Identify weird sequences

jmay29 commented 7 years ago

So, I was fiddling around with some code to identify sequences with vastly different numbers of gaps as Sally had suggested in another issue. I got some code working but I was wondering what would be a good "range" of gappiness? I was thinking something like this:

This will give the number of positions where internal gaps are found for each sequence. internalGaps <- sapply(regmatches(dfInitial$nucleotides, gregexpr("[-+]", dfInitial$nucleotides)), length)

Add it to the dataframe so you can get a sense of which sequences are weird. dfInitial$numberOfGaps <- internalGaps

Mean (or median) number of gaps? meanGap <- mean(internalGaps)

Upper range of gappiness extremeHighGap <- meanGap + 7

Lower range of gappiness
extremeLowGap <- meanGap - 7

We then loop through each sequence to see if the number of gaps deviates greatly from the mean. Any sequences with gaps greater than extremeHighGap or lower than extremeLowGap will be identified. extremeSeqs <- foreach(i=1:nrow(dfInitial)) %do% which(internalGaps[[i]] > extremeHighGap | internalGaps[[i]] < extremeLowGap) extremeBins <- which(extremeSeqs > 0)

Subset out these extreme gappy/ungappy sequences so you can look at them. dfExtreme <- dfInitial[extremeBins, ]

If you want to remove them from the original dataframe. dfNormal <- dfInitial[-extremeBins, ]

jmay29 commented 7 years ago

By the way, I was testing this on aligned, trimmed sequences! (The final alignment)

m-orton commented 7 years ago

Thanks Jacqueline, this looks like it could be really useful for identifying/eliminating some of the weirder sequences we are encountering. Not sure on what would be a good number for the gap interval, may need Sally's input.

sadamowi commented 7 years ago

Hi Jacqueline,

This looks like it would be really useful. Thanks for this.

I'm not sure the "extreme low gap" function will be needed very often, in the context of when aligning against a reference sequence. Typically, one would expect no to very few gaps of sequences against a validated reference sequence in the same taxonomic group.

It would be good if this can be a user-defined parameter (i.e. are you planning to make this into a function?) as to what is the definition of extreme gappiness.

What would constitute an "extreme" level of gaps would depend upon the taxon. For example, in Lepidoptera any internal gap would be of interest. Any gap at all could indicate a lower-quality sequence that is missing data. However, sequences with just a single gap are likely the result of a base calling error, and inclusion of such sequences would have a minimal impact upon the final alignment and results.

So, I guess here, would the point be to identify all possibly lower-quality sequences (that contain some sort of error, such as dropping a nucleotide) OR to focus upon pinpointing the true outliers that could either be major sequencing errors influencing the alignment or that could be biologically very different yet real sequences (e.g. contaminant from another taxonomic group or a highly unusual sequence for a given taxon)?

I would think that for our purposes for now that identifying highly unusual sequences would be our goal. So, I would think that for most taxonomic groups, 7 or more gapped positions would be good benchmark. This would mean more than 2 amino acid deletions (if the sequence is correct).

I also suggest to export the reference sequence into the df with the unusual sequences, for inspection. I suggest also providing an option to export the FASTA for more detailed exploration.

However, I am also wondering whether this might best come into play in a "version 2" of this pipeline as well as being a function in Jacqueline's thesis and planned R package. Version 2 could be used in Jacqueline's thesis work as well as possibly in the final latitude publication, if suitable. (e.g. If we have other revisions to make, requested by the reviewers, we could incorporate additional such improvements to the pipeline.)

Matt has already made the revised scripts for running the final analysis for all taxa for the latitude analysis based upon the sister pair approach. I believe he is manually inspecting the alignments and removing problematic sequences.

What do you think, Matt? Does that make sense to incorporate Jacqueline's new code into a V2? Perhaps, when suitable, you could create a separate master for V2? Or, what makes the most sense for keeping track of the V2? Should that be a new branch of the master? It would be good if we can organize this so that Jacqueline can work on her ideas and new code towards the V2 (and/or R package?) but that we can also move forward on our short-term goal of submitting the initial manuscript as soon as possible.

Thoughts?

Cheers, Sally

sadamowi commented 7 years ago

PS. Correction. I can see that the extremely low-gapped sequences are also interesting. For example, if there are sequences in an alignment with many insertions, they may have no or few gaps in comparison with the reference sequence. So, please ignore that component of my comment above. Both high and low levels of gaps are of interest.

sadamowi commented 7 years ago

Hi Matt,

I wanted to let you know that Jacqueline and I just had a Skype conversation this afternoon. She is going to try the above tool, adjusted to detect sequences with 7+ gapped positions more or fewer than the mean, as a tool to identify unusual sequences.

We also discussed the goal of submitting this version of the sister pipeline, run on latitude, as soon as feasible. If you'd like to incorporate the gap checker, you are welcome to do so, or if you prefer to continue with the manual inspection, that's fine too.

I suggest that you might consider creating a thread with "Ideas to consider for V2". You could paste there the ideas you aren't implementing right now but which you may be interested in implementing for V2.

Jacqueline will also soon convey some findings that she has about adjusting the code to make it run faster. Jacqueline plans to continue to work off Github for the short term.

So, does this sound like a good plan, i.e. to continue to use the issues for Jacqueline to report new sections of code she is creating that you might consider for incorporation for the latitude project?

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

Sounds good, there are a few minor improvements I was thinking about that could be made in a V2 of the pipeline so I could make a thread for V2 improvements/ideas and Jacqueline could post ideas/new code sections there as well. In response to the earlier question about making V2 a separate version or branch, I think giving V2 a separate master branch is probably the best way of doing things.

Depending on the taxa, I may implement the above tool if I'm finding really gappy alignments. If its just the case of a couple of bins causing a gappy alignment, then I will probably just resort to manually removing them.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt and Jacqueline,

This sounds like a good solution to me. That's great that we all seem to be on the same page about this and also have a mechanism for sharing ideas and new sections of code.

Cheers, Sally

jmay29 commented 7 years ago

Hi Sally and Matt - this sounds good to me, too. I will keep on experimenting with this "gappyness" function and keep you updated!

sadamowi commented 7 years ago

Hi Jacqueline,

That sounds great, both for your thesis and for consideration for V2 for this project. I made a note in the "V2" thread that you are working on this. Please do post to the V2 as you make progress in testing your code above or have a revised section of code for Matt to consider for incorporation into V2.

I do think that your above approach of +7 or -7 gapped positions makes a lot of sense as the default. And, I like the idea of outputting the highly gapped sequences for inspection. I think a higher threshold would be preferable if you wanted to automatically eject sequences on the basis of this, especially for certain phyla (e.g. Mollusca). It is possible that your current threshold would be good for your fish.

I will close this issue as a separate thread in favour of moving towards resolution of all outstanding issues relating to V1. The V2 thread will stay open.

Best wishes, Sally

m-orton / Evolutionary-Rates-Analysis-Pipeline

Identify weird sequences #40