roblanf / SRHtests

tests of stationarity, reversibility, and homogeneity for sequence alignments
2 stars 3 forks source link

Random partitions #3

Open suhanaser opened 6 years ago

suhanaser commented 6 years ago

Hi @roblanf. I modified the SRH.py code so it can generate a new random directory with 20 random partitions for each dataset for the MPTS test. Should I update the existing SRH code? or do you think it is better to create a separate function in a new file?

roblanf commented 6 years ago

I'd make a new file, something like SRH_randomised.py. The reason is that I think this will be easier for others (and us, in the future) to follow it that way. Also, that means you can keep the rest of the repository the same, and just add a step to the end, i.e. "Step x: run SRH_randomised.py to generate 20 randomised selections of partitions for each dataset".

suhanaser commented 6 years ago

SRH_randomised.py will actually generate everything + random partitions, is that ok? because right now I'm using variables and functions from the original SRH.py to create these partitions. but if you mean by SRH_randomised.py that it is only generating random partitions then I'll have to write a new different code

roblanf commented 6 years ago

My suggestion is to keep them separate, because it's clearer for others and for us to re-run. Also, most empirical users will ONLY want to run the SRH.py analyses on their data, not the randomisations. The current SRH.py code is locked down and we have run it to generate the current output.

I'd make a file that reads in some of our existing output to get the number of subsets per dataset and per test, and then generates the 20 randomised files.

However, you should do this how you see fit. As long as the code works, is well enough commented, and is public, it's all good!

On 14 March 2018 at 16:09, suhanaser notifications@github.com wrote:

SRH_randomised.py will actually generate everything + random partitions, is that ok? because right now I'm using variables and functions from the original SRH.py to create these partitions. but if you mean by SRH_randomised.py that it is only generating random partitions then I'll have to write a new different code

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roblanf/SRHtests/issues/3#issuecomment-372906393, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pE_LxRLtaUGu7XGzKWpXvYkO5p9Pxks5teKYigaJpZM4Sp0-y .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

suhanaser commented 6 years ago

for some reason there are only 3 datasets on the server and I don't have a permission to copy the rest. can you please do that

roblanf commented 6 years ago

Can't you get them all directly from this repo?

Is the issue that you're not able to write files to the directory? If so, that's definitely fixable!

R

On 16 March 2018 at 13:06, suhanaser notifications@github.com wrote:

for some reason there are only 3 datasets on the server and I don't have a permission to copy the rest. can you please do that

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roblanf/SRHtests/issues/3#issuecomment-373580804, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pEzm4v4W0N0NmJI_HdA1Oc4hBDERdks5tex4NgaJpZM4Sp0-y .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

suhanaser commented 6 years ago

Yeah, I just can't write files or folders to the directory

roblanf commented 6 years ago

Fixed.

On 16 March 2018 at 13:44, suhanaser notifications@github.com wrote:

Yeah, I just can't write files or folders to the directory

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roblanf/SRHtests/issues/3#issuecomment-373586623, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pE9o0zdbDIKeAZ-AVRh_rVxWf_lzhks5teycAgaJpZM4Sp0-y .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

suhanaser commented 6 years ago

Thanks! It works

suhanaser commented 6 years ago

Maybe it is a stupid question, but why should we generate 20 random subsets for both bad and good partitions if we want only to check the all-good relation? isn't enough to just look at the good partitions?

roblanf commented 6 years ago

Two things:

  1. I wouldn't think of it as 20 random subsets of N loci (where N is the number of 'bad' loci in the original analysis). I'd think of it as twenty random partitions of the data into N and X-N loci (where X is the total number of loci). All we're doing is holding N constant, but instead of dividing the dataset based on e.g. MPTS scores, we're now dividing it at random.

  2. I don't think we should just check the 'all-good' relationship, but all three relationships. Open to discuss this though - because thinking about it, I do wonder if it would be better just to focus on that relationship because it's the one that's empirically the most important.

On 16 March 2018 at 15:48, suhanaser notifications@github.com wrote:

Maybe it is a stupid question, but why should we generate 20 random subsets for both bad and good partitions if we want only to check the all-good relation? isn't enough to just look at the good partitions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roblanf/SRHtests/issues/3#issuecomment-373602516, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pE_IVJxI4WVKqu9Uvf9wr4UJBi_zZks5te0QUgaJpZM4Sp0-y .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

suhanaser commented 6 years ago
  1. There are two possibilities the first one is technically harder than the second. I) N and X-N are complementary as in the original situation. in this case I have to find the complementary subset for each partition, but as these partitions are randomly generated it is not trivial what are complementary partitions
    II) N and X-N are independent (you can call them N and M, where N+M=X), the only problem with this solution is that we can count twice some partitions while excluding others

  2. Computationally it is definitely easier to focus only on the all-good relationship and then we don't even have to think about (1). Empirically we know that good-bad relationship has no meaning because bad is the absolute complement of good. but we know that all-bad are not significantly different (Fig.5) so maybe that is what makes it important to show too

roblanf commented 6 years ago

I think we're tying ourselves in knots here.

On point 1.

We have a set X of all the loci. For each of the 20 iterations, you have determined a set N of loci, where the number of loci in N is the same as the number of 'bad' loci from the original analysis. All I'm suggesting is that you then write down the set X-N=K, which will just be all the loci in X that are not in N. The set K will, of course, contain the same number of loci as the 'good' set from the original analysis. I feel like I haven't understood something, because this should be simple to do. My suggestion is then to run IQTREE on X, N, and K, and to recalculate the wSH test for X vs. N, X vs. K, and K vs N.

On point 2: if you have a very convincing argument for only focussing on the all vs. good comparison, then that's OK. But if you don't, I think it would be prudent to do this for all three of the comparisons. You mention one reason to do this. I also think that all vs. bad is not meaningless - indeed, I think that most people would expect that in a perfect world, dividing a dataset into two subsets should give you fairly similar trees most of the time. The fact that we often find the 'good' and 'bad' trees are significantly different is meaningful in that sense. The randomisations will help us gauge just how meaningful it is w.r.t. the model violations we are testing for.

On 16 March 2018 at 16:25, suhanaser notifications@github.com wrote:

1.

There are two possibilities the first one is technically harder than the second. I) N and X-N are complementary as in the original situation. in this case I have to find the complementary subset for each partition, but as these partitions are randomly generated it is not trivial what are complementary partitions II) N and X-N are independent (you can call them N and M, where N+M=X), the only problem with this solution is that we can count twice some partitions while excluding others 2.

Computationally it is definitely easier to focus only on the all-good relationship and then we don't even have to think about (1). Empirically we know that good-bad relationship has no meaning because bad is the absolute complement of good. but we know that all-bad are not significantly different (Fig.5) so maybe that is what makes it important to show too

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roblanf/SRHtests/issues/3#issuecomment-373607238, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pEx2RZG3SIHQD8P_U5ohPLbWZ7VJRks5te0zHgaJpZM4Sp0-y .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

suhanaser commented 6 years ago
  1. I will do that, my point was just that for X-N I'll have to add a new function which is not a big deal
  2. we already know that they are significantly different and I think that if we can show that it is not random result for one relationship (say all-good) it should be enough. but as I can't really prove this I will just show the three relationships.
suhanaser commented 6 years ago

I'm trying to run the random partitions in parallel but I get this message: "Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing."

What does this even mean? I remember now that this was the problem that I couldn't use parallel before

roblanf commented 6 years ago

This is just a message that's asking you to cite that paper when you write up your methods. So you should do that.

It doesn't mean parallel isn't working - it will be working fine!

R

On 27 March 2018 at 14:20, suhanaser notifications@github.com wrote:

I'm trying to run the random partitions in parallel but I get this message: "Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing."

What does this even mean? I remember now that this was the problem that I couldn't use parallel before

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roblanf/SRHtests/issues/3#issuecomment-376383887, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pEwNaAwbmAvF2J0Q3OlpT3BR_Tuy_ks5tia_vgaJpZM4Sp0-y .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

suhanaser commented 6 years ago

I want to run iqtree on all the random folders. Do you want to check the scripts before I do that? I have already checked them on 3 small datasets and everything was fine