m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Alignment of mite group Mesostigmata #32

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

The preliminary alignment (Jan 1, 2017) of Mesostigmata contained large gaps.

Sally to check into this:

Is the beginning of the alignment correct?

Is there any information relevant for this in Young and Hebert 2015?

jmay29 commented 7 years ago

I am also checking over my fish alignment ... it looks good for the subset, but I still have to look at the alignment of the larger dataset.

sadamowi commented 7 years ago

OK great - glad to hear that so far things are working on fish too.

sadamowi commented 7 years ago

Hi Matt,

This is quite interesting. I am looking into some of the sequences that are generating those huge gaps. I have been able to find some records on BOLD that include trace files. I think that what is going on is that these are chimera sequences, containing a mixture of a good sequence (one trace direction) and a case of problematic sequence (other trace file direction). The problematic sequence could be a case of contamination or non-specific amplification. However, the one case I just explored in depth seems to between chimera between two good trace files, but one of them was not reverse complemented correctly perhaps.

I am still exploring this case a little further. However, I wanted to post to this thread now so that you know not to go ahead with this order for now. We'll need to solve this issue, possibly through manual exclusion of these cases. It will be important for us to continue to check the alignments through to the end, due to these unusual errors!

Cheers, Sally

m-orton commented 7 years ago

Thats a really interesting find on Mesostigmata. If its just a few bins causing the problem, I think as you said we could probably just manually omit them from the alignment. I'll let you know if I see another alignment with this problem.

sadamowi commented 7 years ago

Hi Matt and Jacqueline,

OK - So, I was quite wrong in my initial assessment. I was rushing as I was eager to solve the puzzle in advance of school pick-up, but I did know I needed to go through this issue again slowly and carefully.

So, after looking at this again, I think that it is possible that those are real biological sequences. (Or, they are nuclear pseudogenes that really look a lot like real sequences - i.e. no frame shifts or stop codons.) So, I explored the alignments further by first putting the sequences in reading frame and conducting an amino acid-based alignment. The alignment looked much much better when performed like that. However, there was still a huge insertion of 36 nucleotides in 5 BINs. Many of those BINs contain multiple specimens, and sometimes quite a lot of specimens. So, this isn't a case of a small number of isolated sequences exhibiting this property. Also, for one record, I dug into the trace files and found that the trace file supported the text sequence.

Some of those BINs are only identified to order, but some are identified to family. Those that are ID'd to family all fall in family Phytoseiidae, which was previously noted by Young and Hebert (2015) as being haplodiploid and having a high rate of molecular evolution. They performed a study of COI protein evolution in Arachnida more broadly. The supplementary file 1 of their paper contains an alignment but does not possess any sequences with this property that we find here.

So, this is definitely interesting, and I think it's possible that these sequences are correct.

So, our decision in the context of this project is to decide how to treat these BINs. I am going to post this long comment before I accidentally hit delete or something. Then, I will consider this issue further in terms of a recommended action for discussion.

Cheers, Sally

sadamowi commented 7 years ago

Hi again Matt and Jacqueline,

For our records, here is the list of BINs in the preliminary alignment of the order Mesostigmata that exhibited the above issue:

BIN list: BOLD:ACL5723 BOLD:ACI3213 BOLD:ACL2685 BOLD:ACL5764 BOLD:ACP8004

Four of these BINs are quite close together, and so the divergent sequence filter will likely not remove this issue. All five of these BINs are from Canada only (and not the Arctic), and so these are unlikely to find close tropical pairs, given the current dataset. So, they will likely not contribute to our study but are causing problems. While an amino-acid based alignment can deal with these, our present alignment isn't very good. I think the gaps are placed in the wrong places with these sequences in. Implementing an amino-acid based alignment would require further steps, including putting all sequences into reading frame, translating all sequences, checking for indels and stop codons, and identifying problematic sequences.

Therefore, I am going to recommend to delete these BINs right after the data download step.

Any thoughts on this? Shall we give that a go and see how the alignments look?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, no problem, I'll remove these bins when I run through Mesostigmata.

Thanks, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you very much. I hope that group is all good now.

I suggest also deleting the strange fish BIN noted by Jacqueline for Chordata (unless you prefer to see first if it's caught by the divergent sequence filter -- that one might get caught). I believe Jacqueline ran a sequence subset, and so additional problematic sequences could crop up, but hopefully not!

Please go ahead and close this issue once you get the BIN list for exclusion inputted. As we have discussed, we will check all outputted alignments a final time to see if any other alignment issues crop up.

Best wishes, Sally

jmay29 commented 7 years ago

I'm going to be running larger dataset of fish so I will let you know if I come across any more weird BINs!

m-orton commented 7 years ago

So I managed to get the alignment looking better in mites while running it on the server. In addition to eliminating the bins mentioned above, I also had to eliminate ACL8004, ACL5764, ACZ0570, ACZ0583, AAW0366, ABV2830. The first 4 bins mentioned had the issue of a large insertion (over 20 bp) while the last two had an insertion of 3 bp. The results for this group can be found on dropbox.

Best Regards, Matt

m-orton commented 7 years ago

Sorry meant to write ACP8004, not ACL8004.

sadamowi commented 7 years ago

Hi Jacqueline,

This message is in response to your comment (2 comments up from the end of this chain). For running new datasets through the pipeline, I suggest either to use the same reference sequences as I have designated or to use similarly stringent criteria for designating and vetting reference sequences (as you may need more ref seqs, depending upon which groups you are running).

In the Excel file in the "References sequences" folder on dropbox, in tab 2 you can see the criteria I used for designating reference sequences. In tab 3, you will find the designated reference seqs themselves.

UPDATE: We discussed at our Skype meeting that results from your larger fish dataset are not urgently needed for this project, as Matt will be running the entirety of Chordata on the server. I suggest that you may find the results of that run helpful, i.e. for comparing the results from the sister pipeline to the results for latitude from your phylo pipeline. Still, I do suggest to be careful about reference sequence selection for your phylo pipeline as well.

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

Thank you very much for this update, and I'm glad to hear that this group is running successfully now. I have looked at the final, trimmed alignment, and this looks very good to me.

I agree about deleting BINs with such huge indels. We are looking for a subtle effect on nucleotide substitution rates. (We are not investigating indels here, although that would be interesting for future study!)

Given that we now have these "weird BINS" spread across various different threads in the Issues tracker, I'd like to suggest to create a file in the Results folder that collates the list of BINs you have excluded. This way, we can very readily incorporate that information into the Methods section. (i.e. This would be separate from the list automatically caught by the divergent sequence exclusion tool based upon sequence similarity; I don't think that we need to report the list that is excluded by the pipeline ... only the list manually excluded.) Thank you.

I'm hoping that not too many more of these cases crop up where manual sequence deletion is needed. Fingers crossed!

As a general rule, I am a little concerned about the idea of deleting a sequence with just a 3 bp gap, as in some taxa that would occur in many species, and we would throw out too much good data. Single amino acid indels are not uncommon, and these are indeed quite common in some taxonomic groups. In the case of Mollusca, there are some gaps, but I think our solution of using complete deletion was a good solution to that. So, I suggest we consider that option as well if more cases crop up of taxa that are alignable but that have certain sections that are gapped, with uncertain placement of gaps. We could do that for any more groups such as the molluscs, where amino acid indels are very common.

Best wishes, Sally

m-orton commented 7 years ago

Yeah I thought the bins with the 3 bp deletion looked like indels possibly. I like your idea of using dropbox to document these weird bins we keep finding. I'll make a file in results for them. Fingers crossed there aren't too many more!

sadamowi commented 7 years ago

Hi Matt,

Thank you for documenting the "weird" sequences deleted as you go through the groups. I suggest to jot down a (very brief) note about the reason, and then we should articulate a "rule" for the Methods section prose. A rule might be something like deleting sequences with an indel of 20+ nucleotides. That's just a suggestion. We can base that rule based upon the sequences that need to be deleted based upon messing up the alignment, i.e. based on our observations of the alignments.

For Mesostigmata, was that 3 bp indel associated with missing up the rest of the alignment? If not, we might consider putting those sequences back in, in order to apply a consistent rule. We might return to that group after setting the "rule" based upon our observations from the new few groups.

Thoughts?

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, it was two sequences from mites that had the indel from: AAW0366 and ABV2830. I will post the the alignment to dropbox that includes these two sequences. File is called alignmentFinalTrimMesostigmataIndel

These sequences have a 3 bp insertion (AAG) that caused a gap in all other sequences. But as you mentioned, this is most likely a legitimate indel so I could add these bins back depending on what you think about the alignment.

I like the idea of having a standard rule to go on for deleting these sequences based on observations from the other groups.

Best Regards, Matt

m-orton commented 7 years ago

Also, what was weird the sequence from fishes that needs to be omitted again? Just compiling a list of weird sequences for dropbox.

Thanks Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for compiling this list. Indeed, this emerged as an issue in multiple threads, and so it is good to collate all this information together.

For fish, the BIN to be omitted is: BOLD:ADC1808

Reason: The two sequences in this BIN have multiple insertions compared to all other sequences investigated by Jacqueline in its class. One of the insertions is huge (way over 20 bp). Also, there are some matches between parts of this sequence and some plant COI sequences. So, this is a problematic sequence and should be deleted from analysis.

Also, I looked at your file "alignmentFinalTrimMesostigmataIndel". Those two sequences with this 3-bp insertion are not a problem for the alignment or generation of results. Those could be put back in.

However, I did notice a small issue that is suboptimal elsewhere. In the same location, some specimens have a 6-bp deletion, and I think the gap is not quite ideally placed in all of these cases. I think that using pairwise deletion will largely rectify this issue for the distance calculations for most pairs of sequences in the alignment. However, there would be a few pairs of sequences in this alignment that are slightly suboptimally aligned. Based on the position of the gaps, this would not be rectified by using complete deletion of gapped/missing positions, as we decided for Mollusca.

So, if you are willing to run this group again, with those two BINS back in, I will take a more careful look at the final alignment, focusing on pairs of BINs that were actually paired up to make sure those pairwise alignments are fine.

For V2 of the pipeline, I will explore options for doing an amino-acid-based alignment. Even if we have rigorously investigated the alignments manually, I think that for future usages of the pipeline (such as investigating new traits or a larger database of input sequences), it would be helpful to be be able to do such an alignment automatically. I'll make a note in that thread.

Best wishes, Sally

m-orton commented 7 years ago

Thanks Sally,

I will reinsert those bins back into the alignment and run through mites again on the server today. Also, started running through Mollusca so hopefully I can have the Mollusca results up today as well.

Best Regards, Matt

sadamowi commented 7 years ago

Thank you very much Matt. I look forward to seeing the results.

sadamowi commented 7 years ago

PS. I suggest to send me a note whenever you have new results. I will try to look at results within a day, whenever possible, to verify the alignments and to help us to formalize the "rules" about the kinds of sequences we are omitting. Thank you.

m-orton commented 7 years ago

Hi Sally, im leaving a note that im now uploading the new Mesostigmata results. Also, in case you didnt see, the results for Cnidaria, Branchiopoda and Annelida are also up from the 30th. Mollusca to follow soon.

sadamowi commented 7 years ago

OK great! Thanks Matt. I will have a look at this. I suggest I will add "SJA" to the folder name to indicate any analyses that I have checked. This will help us to keep track. I will check the final, trimmed alignments, unless there are any other especially relevant files (such as a separate file with sequences with a lot of indels). Let me know if my proposed adjustment to the folder titles interfered with any of your informatics processes. For the results Excel files, I would find the relative distances data frame helpful, if that's not too much trouble. Thanks again.

Cheers, Sally

m-orton commented 7 years ago

Ah ok no problem, I will include the relative distance dataframe as well from now on. Adjusting the file names is a good idea to help keep track of which taxa have been checked.

I think I might just make a separate thread for new results posts. If there is a specific problem with a taxa then I will make a thread specific to that taxa.

Best regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you for putting the sequences with the small insertion back in. I have checked the final alignment from the Feb 1 server run, and I think it looks good. Closing issue.

Best wishes, Sally