m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Question about filtering by sequence length #29

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

Hi Matt,

I am running Annelida again to further explore and finetune the alignment parameters, specifically trying -3000 gapopen in case that might be a general parameter that is useful for us.

In the Clitellata alignment, I notice that there are some sequences shorter than 620 bp, which have terminal gaps. I am wondering if the length filter (lines 277-279) counts all characters or just nucleotides? Would you mind having a look?

I suggest we would want sequences of at least 620 bp in length. If we implement this filter, then the sequences will be longer than this most of the time, which is good, and then we'd be trimming down. i.e. The sequence read would have gone right through the end of the portion we are using, yielding higher quality data. I noticed that a few of the shorter sequences have an unusual amino acid at the end, suggesting a poorer quality of sequence. For example, possibly a base was added or dropped, yielding a reading frame shift. It's difficult to detect that or for a gap to be added when this is very near the end of the sequence. Therefore, I'd like to suggest that all sequences be at least 620 bp in length, counting only nucleotides.

Thanks very much for having a look.

Cheers, Sally

sadamowi commented 7 years ago

PS. I didn't have that quite right. The sequence wasn't necessarily shorter than 620 bp, but part of the length was at the front, leaving a number of gaps at the end after trimming using the reference.

One idea I have to address this would be that we make the length filter 640 bp (instead of 620), but then still trim to 620. That way, we would hopefully be trimming to higher quality sequence most of the time, rather than using the last few nucleotides of a sequence that is declining in quality towards it end.

What do you think?

Thanks for any thoughts on that.

m-orton commented 7 years ago

Hi Sally,

Thanks for your suggestion, I agree on changing the filter to 640 bp. I can make the changes to the branches.

The length filter counts all characters but I could find a way of only counting nucleotides if that works better. I'm guessing the Biostrings package probably has a function for easily counting nucleotides. I'll look into it.

Best Regards, Matt

sadamowi commented 7 years ago

Thank you very much Matt.

Line 258 (Annelida branch) uses "length" for calculating the proportion of internal gaps. Could that be used? I think the most important thing is excluding the external gaps for counting the sequence length. I'm not so concerned about internal gaps, as those are few.

Again, the few sequences that I looked at relating to this issue were short on the right side but could have been 620 bp in total. I am having a run at Annelida now with the 640 bp filter to see if that helps.

Cheers, Sally

m-orton commented 7 years ago

Thanks Sally, your right about the command at line 258, I should be able to reuse that command to determine the sequence length.

sadamowi commented 7 years ago

Hi Matt,

OK - thank you very much for trying that.

Cheers, Sally

m-orton commented 7 years ago

Think I came up with an easy fix for determining the correct sequence length:

sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides)) sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<640) dfInitial <- dfInitial[-sequenceLengthCheck,]

The gsub function will eliminate the gaps in any sequences that have them so that the nchar function should only count actual nucleotides. Also the minimum sequence length will now be 640. Just tested on Annelida and seems to work!

I'll modify all of the branches with the above code.

Best Regards, Matt

sadamowi commented 7 years ago

Fantastic - thanks Matt!

sadamowi commented 7 years ago

Hi Matt,

I think this looks great. Thank you for implementing this. I think the different filtering and trimming settings will be helpful for the quality of the sequences and results.

I do have a final question before we close this issue, about the specific bp choices. I suggest that 640 filtering and 620 trimming are good choices. However, my final question is whether we would lose a lot of pairs for the Arthropoda? For example, if we would lose 2% of pairs, I think that is fine. However, if we'd lose 20% of pairs, then I suggest we consider 620 filtering and 600 trimming.

I recall you did some explorations some time ago about median sequence lengths and pairings as relating to sequence length? Do you have any prior results that could help us out here?

If not, then perhaps we could do a final consideration of this issue if things run smoothly for Arthropoda on the cloud. We could run the analysis both ways.

Thank you for any thoughts.

Best wishes, Sally

sadamowi commented 7 years ago

Hi Matt,

Sorry to add one more thing to your list on this issue. I realized just now that for Echinodermata that one quite short sequence got through to the final trimmed alignment (I am looking first at class Asteroida - a sequence of 449 bp in that alignment got through). Perhaps this sequence was longer and hence passed the filter, but some of the length was in the "wrong" region.

Is it possible to add a second length-based exclusion step after completion of the final, trimmed alignment? Or, do you have a different idea for this?

For some taxa, this would have a minimal impact, as probably there are very few sequences that are long but not in the "correct" genetic region for us. Most of the sequences in the database will have been generated using primers that bind in the standard region. So, if this is tricky to fix, I think it is OK and we can manually check against that, specifically for Mollusca.

However, if this can be fixed without too much trouble, I suggest this would be helpful. As you will see in another thread, I am going to recommend complete deletion rather than pairwise deletion for calculation of Mollusca genetic distances. So, allowing shorter sequences through could mess up the result. However, we can manually check for this scenario for Mollusca specifically if we don't want to mess with the code further at this late stage. But if the code can readily be made more general, this would save us time if running the code again in the future, e.g. if we are finetuning additional parameters based upon new discoveries about the data or results or if we are running things again after checking for updated sequence data after review, etc.

Thank you for considering this issue.

Cheers, Sally

sadamowi commented 7 years ago

PS. Or, perhaps there is a better way, and the reference sequence could be included in alignment 2 and the length filtering done after that, prior to the final alignment? Any different ideas welcome!

sadamowi commented 7 years ago

PPS. Pardon me for this long chain! I should have looked at all of the echinoderm classes before posting. The diversity of genetic regions within COI and the diversity of primers used appears to be much higher for Echinodermata than for the other classes (based upon info for 4 classes: I have run Annelida, Cnidaria, and Echinodermata, and I looked at your Mollusca results).

We would lose a lot of data for echinoderms if we implement the solution that I proposed above: deleting sequences based on length a final time after trimming using the reference. I wanted to post now so that you don't work on this issue now. I am exploring other options for dealing with this... such as using partial deletion at the stage of calculating the distance matrix.

I will update again soon on this issue!

sadamowi commented 7 years ago

OK - So, unfortunately, I do not see an option for partial deletion in the distance calculator we are using. (This is an option in MEGA.) I was going to recommend partial deletion for Echinodermata with a setting such as 0.60 (i.e. positions with less than 60% data coverage would be deleted at the time of calculating the distance matrix).

I suggest that we leave this issue alone. For most groups, there will be very very few short sequences that get through, based upon the filters that we do have in place. In echinoderms there will be a few.

What I will do is compare the results for echinoderms with pairwise vs complete deletion to make sure this doesn't have a large impact upon the conclusions. For all other groups, I think this will most likely be an extremely minor issue.

So, again, pardon the long chain with all these comments. However, at least you will see this issue I was considering and you can let me know if you have any alternative view compared to the conclusion I came to here.

Also, in light of my many subsequent comments, please do see further up this chain for my question about 2% vs. 20% of Arthropoda pairs. Thank you!

m-orton commented 7 years ago

Hi Sally,

Sorry for the delay in responding (had to go and buy a turkey for Christmas).

Thanks for all of your work on this! In regards to the 2% vs 20% of Arthropoda, I just did a few tests on the Lepidoptera initial dataframe that I had saved. I ran a sequence length check with the original code: sequenceLengths <- nchar(dfInitial$nucleotides) sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<620)

I ended up with 26097 sequences being filtered out, out of a total of 573242 sequences or roughly 4% being filtered out.

Then I tried with the new code: sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides)) sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<640)

I ended up with 65798 sequences being filtered out, out of a total of 573242 sequences or roughly %11 being filtered out.

Then I tried using the new method of counting nucleotides with the min sequence length of 620: sequenceLengths <- nchar(gsub("-", "",dfInitial$nucleotides)) sequenceLengthCheck <- which(sequenceLengths>1000 | sequenceLengths<620)

This gave a very marginal increase of sequences being filtered from the original code to 26136 sequences being filtered which still ends up being around 4% total sequences.

So the question would be if we are willing to exclude 11% of sequences with the minimum of 640 sequence length?

Best Regards, Matt

m-orton commented 7 years ago

Adding to that, I cant seem to find any results I have for the effects this sequence would have on pairing results. I think when I did those tests, that was with an earlier version of the script so its hard to say what effect the min sequence length would have with the current iteration of the script. Hopefully what I mentioned above can help us make a decision.

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you very much for this summary above. I think that if 11% of the sequences are filtered out, likely the percentage of BINs that would be filtered out would be smaller. At BIO, many insects are sequenced unidirectionally, but then there is a step for seeking bidirectional sequences for a few representatives per BIN. It's just that that work is in progress, and so not all new BINs would be covered yet. Nevertheless, I predict that our % of BINs dropped would be lower than the % of sequences dropped.

So, I am OK with proceeding with the 640 filtering and 620 approach. I think that overall our results will be more accurate. The results will be based on longer sequences and a higher proportion of sequences that have bidirectional sequencing coverage.

Sound OK?

Best wishes, Sally

m-orton commented 7 years ago

Sounds good, ill proceed with the 640 and 620 approach.

Best Regards, Matt

sadamowi commented 7 years ago

Great! Thanks Matt. I will close this issue. As usual, we will look out for additional issues that arise in the new analyses.

Best wishes, Sally