m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Discussion - Running Cnidaria #27

Closed sadamowi closed 7 years ago

sadamowi commented 7 years ago

Hi Matt,

Sorry for my delay in posting. I have been working on the project a fair bit recently but have also made some errors, such as forgetting to save the Workspace. I needed to re-run some things in order to be able to compare properly the different trimming lengths as well as alignment settings. In any case, I have now moved on to Cnidaria.

I have been encountering a few errors. I have been running the Annelida code from Github (most recently obtained today) but with the Cnidaria reference sequences inserted. I've also been using maxiters = 3 and diags = TRUE for all alignments. The alignments mostly look very good. I will try a different run with a different setting to see if it's possible to resolve one case, but I think overall the alignments look good and that close relatives are properly aligned with one another, which is the most important thing here.

However, I am getting some errors in some other parts of the code.

The first one is in lline 625 in the code (note these line numbers are after I made adjustments for Cnidaria, such as inputting the Cnidaria ref seqs. Line numbers would be slightly different compared to the Annelida branch, but within a couple of lines of the Github version.)

dnaStringSet4 <- foreach(i=1:nrow(dfRefSeq)) %do% subset(dnaStringSet4[[i]][-refSeqRemove[[i]]])

Error message:

Error in subset(dnaStringSet4[[i]][-refSeqRemove[[i]]]) : task 4 failed - "subscript is a logical vector with out-of-bounds TRUE values"

Same message for similar code a couple of lines later.

Also, error at line 711.

dfPairingResultsL1L2$inGroupPairing <- rep(1:(nrow(dfPairingResultsL1L2)/2), each = 2)

error message:

Error in $<-.data.frame(*tmp*, "inGroupPairing", value = c(1L, 1L, : replacement has 2320 rows, data has 2321

Also error after 716

dfPairingResultsL1L2 <- (dfPairingResultsL1L2[,c("inGroupPairing","record_id","bin_uri","values", "inGroupDistx1.3","medianLatAbs","medianLatMap","latMin","latMax", "binSize","phylum_taxID","phylum_name","class_taxID", "class_name","order_taxID","order_name","family_taxID", "family_name","subfamily_taxID","subfamily_name", "genus_taxID","genus_name","species_taxID","species_name", "nucleotides","ind","medianLon")])

Error message:

Error in [.data.frame(dfPairingResultsL1L2, , c("inGroupPairing", "record_id", : undefined columns selected

and, further error messages are beyond these also.

I have thought of something to try (deleting those reference sequences associated with classes that didn't end up yielding data). I will keep you posted as to whether that works.

Cheers, Sally

sadamowi commented 7 years ago

Hi Matt,

I am happy to report that this solution did work. I removed the reference sequences for which sequence data were not yielded in the outputted alignments. In one class, there was no outputted alignment at all, probably because there were no data passing initial filters. In a second class, there were sequences in the interim alignments, but by the final, trimmed file, there was only the reference sequence left. Deleting only the former case did not solve the problem. Deleting reference sequences for both of these classes did solve the problem. There were no errors (up to the plotting section - I haven't implemented your suggestions yet on that component), and I got reasonable results.

(By the way, your solution for the markercode worked well. I looked at the raw API download and noticed that there were a variety of markers present for Cnidaria, but your filter worked well to keep just COI-5P for analysis.)

Relating to this problem mentioned above, there could be two approaches to this so that future users aren't confused:

  1. just add a comment about this problem and indicate that the references not yielding sequences in the final trimmed sequence files should be deleted and then the code rerun.

  2. adjust the code to be more tolerant of this scenario.

I am fine with either option, especially considering we are focusing upon completing our own paper in a timely fashion for now.

Thanks for your thoughts.

Cheers, Sally

m-orton commented 7 years ago

Hi Sally,

Sorry about the errors with the Cnidaria alignment. I was able to address this issue by adding a few lines of code from lines 629-638. This should check dnaStringSet4 for any alignments that only have a reference sequence and remove them. I ran through Cnidaria myself to check and the code works great now. I updated the Annelida, Mollusca and Cnidaria branches with these changes.

The Cnidaria branch should now be up to date with the other branches and should not be giving errors besides the plot error.

The code I added 629-628 in Cnidaria branch (may be slightly different in other branches):

`#We also have to check dnaStringSet4 for alignments that only contain a reference sequence and remove these. alignmentCheck <- foreach(i=1:nrow(dfRefSeq)) %do% which(length(dnaStringSet4[[i]]) == 1) alignmentCheck <- which(alignmentCheck>0)

Subset dnaStringSet4, taxalistcomplete, alignmentFinalTrim and refSeqRemove by the alignment check

if(length(alignmentCheck>0)){ dnaStringSet4 <- dnaStringSet4[-alignmentCheck] taxaListComplete <- taxaListComplete[-alignmentCheck] alignmentFinalTrim <- alignmentFinalTrim[-alignmentCheck] refSeqRemove <- refSeqRemove[-alignmentCheck] } `

m-orton commented 7 years ago

Sorry the formatting doesnt seem to work:

We also have to check dnaStringSet4 for alignments that only contain a reference sequence and remove these.

alignmentCheck <- foreach(i=1:nrow(dfRefSeq)) %do% which(length(dnaStringSet4[[i]]) == 1) alignmentCheck <- which(alignmentCheck>0)

Subset dnaStringSet4, taxalistcomplete, alignmentFinalTrim and refSeqRemove by the alignment check

if(length(alignmentCheck>0)){ dnaStringSet4 <- dnaStringSet4[-alignmentCheck] taxaListComplete <- taxaListComplete[-alignmentCheck] alignmentFinalTrim <- alignmentFinalTrim[-alignmentCheck] refSeqRemove <- refSeqRemove[-alignmentCheck] }

m-orton commented 7 years ago

Meant to say lines 629-638 for changes.

sadamowi commented 7 years ago

Hi Matt,

First off, no apology needed! I appreciate your hard work and collaboration on this project. We thought that small glitches could emerge as we run the different taxa, as there are slight differences in the datasets (e.g. coverage of sequences, markers present, etc.).

Thank you very much for working on this and finding a solution.

I consider the Cnidaria analysis nearly complete. I am going to try one more setting to see if that addresses a slight hiccup specifically in the Hydrozoa alignment. However, I think overall everything is running very well.

Cheers, Sally

sadamowi commented 7 years ago

I believe this issue can now be closed, as I will make notes on the alignment settings in the alignment discussion thread. I will let you know if I encounter any further issues as I run the updated code from Github.

Thank you again!