matsengrp / ecgtheow

Ancestral lineage reconstruction using BEAST or RevBayes
2 stars 2 forks source link

health metrics agreement #13

Closed lauradoepker closed 6 years ago

lauradoepker commented 6 years ago

Ahh, I'm freaking out a little this morning because I'm worried that the "health metrics" (healthy_to_fasta.py) we apply in the ecgtheow run_beast_asr.sh script are not the same as the health metrics we apply in the CFT pipeline.

Stop codons Last word in CFT is that I maintained that 1 STOP codon could be allowed, but in the ecgtheow project, I determined that I wanted no STOP codons allowed. I forgot that ecgtheow was operating independently. @metasoarous and @dunleavy005 we need to make sure we're on the same page here -- and maybe using/updating the same scripts?

Indel reversal I can't tell for sure, but I think we're still using indel reversal in ecgtheow healthy_to_fasta.py? I'm not sure what the latest decision was regarding indels @matsen ? I remember they messed up some trees but that we weren't sure if we wanted to ignore them quite yet. I was in favor of ignoring them completely at one point, but that was more of a "want" not a logical/conservative decision.

dunleavy005 commented 6 years ago

Stop codons From my side, there's only two choices: include or not include stop codons. That partis flag is binary, so Chris must be using different information if he's flagging 1 stop codon.

Indel reversal Yes, we're using indel-reversed sequences (I assumed b/c of alignment simplicity).

metasoarous commented 6 years ago

No, we switched to not allowing any stop codons. It was 1 stop codon allowed for an earlier version of the methods. It's still the case that I do this processing post-partis, though it would be convenient in various ways if they were filtered out before I see them.

matsen commented 6 years ago

Here's what shows that we're using indel-reversed sequences: https://github.com/matsen/pandis/blob/14866153333c3bddc597f55fd90be5cc31d23fdc/healthy_to_fasta.py

lauradoepker commented 6 years ago

Okay, so we have two options here:

1) Keep CFT and ecgtheow scripts/pipelines as they are and continue independently performing health screening on partis output. Laura would be responsible for keeping track of and communicating any changes to this process to the other team (e.g. if we went back to allowing 1 stop codon... though I expect us to change anything in the near future).

OR

2) @metasoarous provides the portion of script that he uses to perform health screening and @dunleavy005 links this in to ecgtheow. This would require that @metasoarous 's script has an output file (.fasta) of healthy sequences that would be usable for both CFT and ecgtheow. If this is the case, we would just use that common output file as our input to both workflows and make sure that we always update (git submodule update for ecgtheow) before working.

@metasoarous thoughts? I want to AVOID making more work, since option 1 is okay too: I can just monitor the situation. However, option 2 is preferable because then we have no doubt we are working with the exact same healthy sequences between projects.

metasoarous commented 6 years ago

Right now the part of the code that does this is baked deep within our process_partis.py script. That spits out healthy sequences as a fasta file, but also does a bunch of other processing to spit out json files, and other things. At some point there was talk about making that a more generally useful thing outside of cft (I think it may have been getting used as a standalone script by someone), but we haven't really had our eyes on this for a while. I'd be happy to push on this end so that everything is using the same script.

matsen commented 6 years ago

Seems like there are two options here, no? One is to isolate that script, and another is to have cft make a fasta along the way that has the sequences it will use downstream.

lauradoepker commented 6 years ago

^I like the "CFT spits out a healthy fasta" option because then that healthy fasta can be used as input to ecgtheow straight away. This cuts down on re-computing the same thing.

metasoarous commented 6 years ago

CFT already does this, so... issue closed?

The other crazy option is that we just plug ecgtheow directly into cft so you can flag partitions (via the input yaml files) that you want to run ecgtheow on, and it'll do it to it. The way I've reorganized the interface around reconstructions will make this relatively easy to shoehorn into the current reconstruction flow. Thoughts?

matsen commented 6 years ago

Issue's closed as soon as you point @lauranoges at where this file is-- or is it already the download link?

lauradoepker commented 6 years ago

@metasoarous I still want to tinker with the ecgtheow settings (# iterations, etc) so let's not plug it into CFT quite yet. Yes, please point me to the healthy output fasta files.

dunleavy005 commented 6 years ago

I'm closing this via https://github.com/matsengrp/ecgtheow/commit/0817b981e4b8d35997613fc49dbfb9c50c61ab5b