m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

New Mollusca Results #31

Closed m-orton closed 7 years ago

m-orton commented 7 years ago

Hi Sally,

Just letting you know I am currently uploading the new Mollusca results to the shared folder in dropobox. The results are from using the alignment parameters you suggested (maxiters = 3, diags = TRUE, gapopen =-3000) and pairwise deletion set to FALSE for the final pairwise distance matrix. The alignments look good overall though there are still a few gaps I'm noticing in Bivalvia and Gastropoda. Though I think its still an improvement from the second runthrough.

The number of pairings has increased slightly with this runthrough however I am getting more pseudoreplicates than before.

Best Regards, Matt

m-orton commented 7 years ago

Closing this issue since Mollusca will be runthrough again a final time on the server.

sadamowi commented 7 years ago

OK - thanks Matt. I was still intending to look at those alignments, to make sure that our selected settings are working OK. I will have a look and let you know if I think anything should be changed before the final run. I am having a look at those now.

Cheers,

Sally

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml


From: Matthew Orton notifications@github.com Sent: Monday, January 23, 2017 11:26:42 AM To: m-orton/R-Scripts Cc: Subscribed Subject: Re: [m-orton/R-Scripts] New Mollusca Results (#31)

Closing this issue since Mollusca will be runthrough again a final time on the server.

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/31#issuecomment-274536615, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89OlaUj4mqdJeZtvIuhItNmpUMDTR_ks5rVNTCgaJpZM4LVG0j.

sadamowi commented 7 years ago

Hi again Matt,

Sorry for the delay in responding to this issue. I am reopening this issue to discuss something.

I am looking at the alignment called "alignmentFinalTrimBivalviaNew" within the folder called:

Mollusca - run 3 - new alignment params - pairwise deletion = FALSE

I also notice the gaps, as you mentioned. However, we changed the parameters to address this in a satisfactory way, with pairwise deletion = FALSE. With the stringent gap opening penalties, I think the alignment makes sense and isn't excessively gappy (i.e. gaps are grouped together, as would be expected as amino acid indels occur in groups of three nucleotides). There are actual amino acid indels. The specific placement of the gaps is in question, but with pairwise deletion set to FALSE, the analysis is resistant to the exact placement of those gaps.

However, the problem here is that a sequence has gotten through that is quite short. With pairwise deletion set to false, we will be omitting too large a portion of the sequence. Those missing sites will be completely deleted.

The short sequence is ACH2910.

I thought we had discussed a solution to this issue, with there being another count of nucleotides following the trimming of the sequences according to the reference sequences. This step would allow sequences to be caught that got through the first length filter (due to long total length) but that didn't have enough length within our target region. (This could happen if a specimen was analyzed with different primers, for example.)

Was this most recent Mollusca run that I found in the dropbox done before that change to the code?

Thanks very much for verifying that everything is now set to address such cases for the final run.

Best wishes, Sally

sadamowi commented 7 years ago

Hi again Matt,

I am glad I am having another look, as I think it remains important that we continue to check the alignments until the end of obtaining the final results. I'll do a final check of the alignments as well after you run through the groups on the server.

I am now looking at this file within the latest Mollusca run: alignmentFinalTrimCephalopodaNew

I have discovered a very odd situation. I think that BIN ACH5013 contains serious errors. The beginning and middle of the sequence are fine, but I think the end portion is wrong and contains editing errors. It doesn't match closely to anything. This problem could have occurred if the reverse primer picked up a contaminant and then the forward and reverse were crammed together into an incorrect consensus sequence. This sequence is in BOLD but was mined from GenBank. There are no trace files.

This is a very odd situation. In prior cases that we detected of errors, the new section of code you created to identify and eject "highly divergent" sequences picked those up. In this case, this was not picked up, as much of the sequence was correct. I don't think we want to set a more strict divergent sequence threshold, as we want sequences up to 0.15 divergent from other for our analysis.

I think one could create a more sophisticated tool to deal with this (such as BLASTing the early, middle, and end portions of sequences). However, I think that is beyond our present scope. Another option that we could note for a future improvement to the script (perhaps by someone else) is a translation step and then the divergent sequence check can be repeated on the amino acid alignment.

What I'd like to suggest is that we insert a comment that we inspected the "finaltrim" alignments of a penultimate run of each taxon and that we manually excluded additional problematic sequences. So, for the final run through, I'd like to suggest to exclude that BIN (ACH5013) after the initial download step and before doing anything else.

For similar reasons, I also suggest excluding BIN BOLD:ACH6643 (this BIN is also in Cephalopoda within the phylum Mollusca).

What do you think? I'd most certainly welcome any other suggestions to deal with this. Hopefully we can come up with an easy solution, given we wish to wrap up this phase of the project.

This is a very odd case, but such cases will crop up. Mistakes can be overlooked, and we are dealing with a large public database. In this case, this is a GenBank sequence but from an unpublished study. I think that the sequence did not go through careful scrutiny before submission to GenBank. My best guess is that that's a chimera of a good sequence combined with a contaminant or some other part of the genome amplified due to non-specific binding of the reverse primer.

Best wishes, Sally

sadamowi commented 7 years ago

Hi again Matt,

I have finished looking at the Mollusca alignments from the last run. I was happy with the alignments for Polyplacophora and Caudofoveata.

For Gastropoda, the overall alignment looked good. There was the same issue as for the Bivalvia alignment, where there were a few short sequences that would result in a lot of sequence positions being thrown out, with pairwise deletion set to false.

Perhaps this is already addressed, and I am misremembering as to whether the latest Mollusca run was performed before or after the second sequence length check.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally,

Thanks for looking into the Mollusca results, my mistake on closing the issue, i'll check before I close any other issues.

There is a count of nucleotides in the initial filtering step (section 2) where I filter based on sequence length between 640 bp and 1000 bp counting only nucleotides (not gap characters). However I dont have code that filters according to sequence length after trimming with the reference. I apologize, I mistunderstood when we talked about this before as I thought you were referring to changing the initial sequence length filter.

However I can add additional code in the reference sequence trimming section that does this sequence length filtering. Can you let me know what you would want the sequence length min and max to be for this?

I will then use this code to eliminate the shorter sequences you mentioned in Bivalvia. In the final runthrough of Mollusca I will also manually exclude the bins ACH6643 and ACH5013 in the initial filtering step of the pipeline as well.

Does this sound good before proceeding with Mollusca?

Best Regards, Matt

sadamowi commented 7 years ago

Dear Matt,

No worries. It would be easier to keep track of things if I replied faster. I will try to stay on a roll here!

Yes, I think that initial sequence length filter is a good idea. I agree with you about adding a second length-based filter after trimming according to the reference sequence. I think that is particularly important for Mollusca (given our change to the pairwise deletion option).

(I think that makes sense in other groups as well, but we could just make that change in the Mollusca branch for now and see how that goes. It would make less of a difference in the other branches, and I believe leps are complete now?)

In terms of length, what about 630? We could also go with 640 again. Any missing data in a sequence will result in the length of the entire alignment being trimmed. So, maybe 640 is best. What do you think?

That sounds good about filtering out those two problematic BINs. If it's readily feasible, I suggest to add the deletion of those BINs to the code itself, so that this is clear and also so that we don't forget to do that if we need to run the code again sometime.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally, thanks for the response, I think I'll just make the filtering change to Mollusca for now since the other classes didnt seem to have this problem. Leps is complete now and since the alignments were good I dont think we would have to worry about implementing this filtering change there.

I think 640 bp would be good and I'll make sure to add a comment in the Mollusca branch specifying the BINs we are removing.

Best Regards, Matt

sadamowi commented 7 years ago

OK sounds good - thank you!

-- Sarah (Sally) J. Adamowicz, Ph.D. Associate Professor Biodiversity Institute of Ontario & Department of Integrative Biology University of Guelph 50 Stone Road East Guelph, Ontario N1G 2W1 Canada

Email: sadamowi@uoguelph.ca Phone: +1 519 824-4120 ext. 53055 Fax: +1 519 824-5703 Office: Centre for Biodiversity Genomics 113 http://www.dnabarcoding.ca/ http://www.barcodinglife.org/ http://www.uoguelph.ca/ib/people/faculty/adamowicz.shtml


From: Matthew Orton notifications@github.com Sent: Monday, January 23, 2017 11:49:43 PM To: m-orton/R-Scripts Cc: Sarah Adamowicz; State change Subject: Re: [m-orton/R-Scripts] New Mollusca Results (#31)

Hi Sally, thanks for the response, I think I'll just make the filtering change to Mollusca for now since the other classes didnt seem to have this problem. Leps is complete now and since the alignments were good I dont think we would have to worry about implementing this filtering change there.

I think 640 bp would be good and I'll make sure to add a comment in the Mollusca branch specifying the BINs we are removing.

Best Regards, Matt

- You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/m-orton/R-Scripts/issues/31#issuecomment-274709141, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AV89OgxVhMA2Zz4k0GateFijvhjCkNqtks5rVYLngaJpZM4LVG0j.

sadamowi commented 7 years ago

Sorry Matt. I should revise the cutoff issue. I wasn't thinking clearly about that. I've reviewed the current approach: filtering for at least 640 total length and then trimming to ~620, so that the good sequence reads through the ends of where we are trimming. So, if we went with 640, we would lose everything.

For the post-trimming cut-off, we would need 620 bp or less. I'd like to recommend somewhat less, as there are some biologically real amino acid indels in molluscs. We wouldn't want good sequences being thrown out that have a single amino acid deletion, for example. I think even 600 bp could be good, upon looking at the data. We wanted to get out a few cases of sequences missing a lot of data. What do you think?

Also, I am wondering if something about the trimming isn't quite doing what we thought?

If we are trimming the reference sequence by 19 bp on both ends, then I would expect the alignments to be 620 bp if there are no gaps (for reference sequences that are 658 bp long; a few mollusc refs are longer). I would therefore expect the alignments to be more than 620 bp if there are gaps. I.e. the reference sequence would be 620 in nucleotides plus whatever gaps are inserted. I notice that some of the alignments are 620 or 623 despite having quite a few gaps. I was wondering if you might please check that. I would expect this to be of modest impact, but this relates to reproducibility and having what the analyses do match our description of what we are doing. On the other hand, if the analysis is analyzing 620 alignment positions, then we can rephrase that. I think that, however, analyzing the same biological fragment makes more sense. Thanks very much for letting me know if you have any thoughts on that.

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

I think 600 bp would be a good cut off for min length. Your right that using 640 bp would result in not having any sequences after trimming, I just realized that lol. I see what you mean about the trimming, I'll have to look into this further and get back to you.

Best Regards, Matt

sadamowi commented 7 years ago

OK sounds great Matt! Let me know once you've re-run Mollusca on the server, and I'll have a final look at the results.

m-orton commented 7 years ago

Hi Sally, just posted the full result set for Mollusca to dropbox.

Also see the weirdBins folder for the bins I had to remove and the reason. You will notice in dropbox there are also final alignments for Bivalvia and Gastropoda with end suffix "gap" to the name, these were what the alignments looked life before removal of any additional bins so you can look at those as well.

I think the alignments are looking much better now for both Gastropoda and Bivalvia. Gastropoda had a couple of bin sequences with 19 bp insertions that I had removed and this seemed to improve the alignment greatly. I figured these insertions were large enough to justify removal.

Bivalvia had several bin sequences with smaller 10 bp insertions that I removed however if you think these are biologically significant we can add these back in. The Bivalvia alignment does seem to have very small gaps in some regions however these appear to be real indels. (but I leave that for you to decide) There appear to be instances of 3 bp indels in some of the sequences and instances of 6 bp indels near the end portion of the alignment.

Let me know what you think,

Best Regards, Matt

sadamowi commented 7 years ago

Hi Matt,

Thank you very much. I have looked over this most recent server run, and I agree with your decisions on sequence inclusion/exclusion. Closing this issue!

Best wishes, Sally