Closed eharkins closed 5 years ago
Yeah, except I think I've changed my mind about how to specify the indel parameters. I think maybe this is what laura was suggesting and I was just being dense, but I think it's probably better to just say "match the indels in this sequence", i.e. specify a uid, rather than having to specify the length/pos/type of the indel.
@eharkins I'd like the filtered seqs outfile to be named a little more explicitly, something like indel_filtered_cluster_seqs.fa
. Since all sequences in EC will be indel_rev, I'm okay with this fact not being reflected in the file name, but if we do add it (to both), it may prevent future forgetfulness on my part about indel reversal.
A few things here:
+indel
or something (maybe custom? --indel-tag
?); indel_rev
seems to imply that the indel has been reversed in the sequence, which may or may not be the case, but is besides the matter at hand if I understand correctly.--filter-indel-pattern-in uid
or --filter-indel-pattern-out uid
, riffing off @psathyrella's suggestion.Thanks for the input here. It seems like we are going to spend a little bit more time on thinking about how best to handle the particular indel-ed family Laura is currently dealing with, then we can generalize a solution like this if appropriate. @matsen, @lauradoepker let me know how I can be of help in determining the best way forward with that family.
@eharkins it's completely up to you to decide how generalized you write the code at this point. I want 157.Vk settled as soon as possible, but not at the cost of you having to rewrite all your code later to make it more generalizable. This issue, then, is for you and @matsen to decide.
@lauradoepker would like the ability to run (ecgtheow) on only the subset of sequences in a particular cluster that have a given indel ( I am opening this issue on cft because the way ecgtheow processes partis output is by using
cft/bin/process_partis.py
).This option would come with other options to specify the indel of interest, including:
The name is up for debate; something like : --only-with-particular-indel, --unique-indel, --indel-filter, etc. Going to call it
--only-with-particular-indel
for now:--only-with-particular-indel
,process_partis.py
looks to make sure you have specified other options (see above) to define the indel you care aboutinput_seqs
, so as to be able to make sure the indel of interest is there or not) based on containing the indel of interest by using the information from the associated options (see above) and https://github.com/psathyrella/partis/blob/dev/python/utils.py#L634. @psathyrella does this make sense?indel_reversed_seqs
sequences corresponding to the remaining IDs after filtering (we may just want to use whichever key would normally be used based on the existing--indel-reversed-seqs
option - which happens to be used in ecgtheow context)._indel_rev
appendedcluster_seqs_indel_rev.fa
alongside the unfiltered cluster sequences incluster_seqs.fa
(usingindel_reversed
)Assuming this makes sense to everyone (cc @matsen), I will open separate issues:
--only-with-particular-indel
and an indel is encountered in the specified seed sequence. The message would tell the user to use--only-with-particular-indel
or specify something to ignore it like--ignore-seed-indel