m-orton / Evolutionary-Rates-Analysis-Pipeline

The purpose of this repository is to develop software pipelines in R that can perform large scale phylogenetics comparisons of various taxa found on the Barcode of Life Database (BOLD) API.
GNU General Public License v3.0
7 stars 1 forks source link

Ideas to consider for V2 #41

Closed m-orton closed 7 years ago

m-orton commented 7 years ago

Making a separate thread for ideas and code sections that could be implemented in version 2 (V2) of the pipeline.

sadamowi commented 7 years ago

Ideas for V2:

-Explore options for doing more automated screening for problematic sequences. Jacqueline is working on such code now, specifically as relating to detecting sequences with many indels.

-Also, explore options for detecting species with highly unusual amino acid sequences. This can be indicative of a major contamination event, such that the wrong organism was sequenced from a very different taxonomic group. This can also occur for artefactual reasons rather than biological, such as when a nucleotide is dropped erroneously during sequencing or sequence editing, resulting in a frame shift. When such frame shifts happen near the end, there may not be enough sequence left for a stop codon to crop up. Stop codons block a sequence from being assigned a BIN, but unusual amino acid sequences do not. To address this, explore integrating tools already available in the VLF package.

-Explore options for performing amino-acid-based alignments. This can improve the gap placement. Additionally, having sequences in the correct reading frame opens new analysis options for looking at variability at first, second, and third codon positions and for looking at ratios of non-synonymous-to-synonymous substitutions. These options permit exploration of underlying mechanisms behind differences (e.g. differences in mutation rate versus selection regime).

sadamowi commented 7 years ago

For the final version, I suggest that we consider whether we should use a specified number of nucleotides from the reference sequence in the final, trimmed alignment, rather than a specified number of total alignment positions. Currently, the alignments are 620 bp long, regardless of gaps. This is an edit expected to be of minor impact to consider for the future.

sadamowi commented 7 years ago

DRAFT code from Jacqueline's gappy sequence detector tool, copied from a closed thread, for consideration for V2.

This will give the number of positions where internal gaps are found for each sequence. internalGaps <- sapply(regmatches(dfInitial$nucleotides, gregexpr("[-+]", dfInitial$nucleotides)), length)

Add it to the dataframe so you can get a sense of which sequences are weird. dfInitial$numberOfGaps <- internalGaps

Mean (or median) number of gaps? meanGap <- mean(internalGaps)

Upper range of gappiness extremeHighGap <- meanGap + 7

Lower range of gappiness extremeLowGap <- meanGap - 7

We then loop through each sequence to see if the number of gaps deviates greatly from the mean. Any sequences with gaps greater than extremeHighGap or lower than extremeLowGap will be identified. extremeSeqs <- foreach(i=1:nrow(dfInitial)) %do% which(internalGaps[[i]] > extremeHighGap | internalGaps[[i]] < extremeLowGap) extremeBins <- which(extremeSeqs > 0)

Subset out these extreme gappy/ungappy sequences so you can look at them. dfExtreme <- dfInitial[extremeBins, ]

If you want to remove them from the original dataframe. dfNormal <- dfInitial[-extremeBins, ]

jmay29 commented 7 years ago

I will also have this gappy.R function saved to a repo on my github (started it yesterday!)

jmay29 commented 7 years ago

I have been looking into other MSA options in R and found a package called DECIPHER. I'm going to experiment with it to see if it improves memory usage as that is the main struggle I've been having!

https://www.bioconductor.org/packages/devel/bioc/vignettes/DECIPHER/inst/doc/ArtOfAlignmentInR.pdf

Thought you guys might be interested in this for future versions of the lat pipeline.

m-orton commented 7 years ago

Hey Jacqueline, thanks for posting this, I would be really interested to see performance/memory comparisons between Muscle and Decipher alignments.

sadamowi commented 7 years ago

For the large taxa that were sub-divided by region in order to run, I suggest it would be helpful to cross check the northern BINs included in pairs for North America vs. Eurasia. For overlapping BINs, the results between the continents could be averaged. While this would likely be a small amount of overall BINs when considered as a proportion, the total absolute number of BINs with northern Holarctic distributions could be large for the largest of the insect orders. Something to consider.

sadamowi commented 7 years ago

Hi Jacqueline,

Another thing to consider for your gap detector is gaps or insertions of 1-2 bp. (Ideally, gaps would be in multiples of threes, as real sequences would have entire amino acids as indels. Indels of 1-2 bp can be indicative of a pseudogene or, more frequently, a sequence editing error.)

For example, in the alignment file "alignmentFinalTrimLepAUS" (folder Lepidoptera - Server Run - Jan 15 - SJA), BIN AAD6260 (sequence number 1524 in the alignment) has an insertion near the end, causing a gap in the alignment. I think that likely an extra G was called in that region of the sequence, as you can see the string of Gs in that region. I think that ideally such regions would be trimmed off the sequence.

Anyhow, that is a suggestion for consideration. I think that this issue (which is very rare, given the huge size of the alignment) will not have a great impact on the results here (as there is just one nucleotide that I think is incorrectly aligned in the end, due to the extra G), and that sequence might not be in a pair, given the huge alignment size. However, I think that such a tool could be helpful for the future. For example, if we were diving deeper into variability at first, second, and third codon positions, identifying such cases could be very helpful.

Again, I suggest to check out the package VLF before delving too deeply into this issue. I believe Taryn did look at variability at the different positions. So, it would be worth considering her strategy before doing something new.

Best wishes, Sally

m-orton commented 7 years ago

Hi Sally and Jacqueline,

I was thinking about changes that could be made to V2 of the pipeline. Here are a few changes that come to mind that I wanted to implement in V1 but didnt have time: Performance: -Switch over to data tables package entirely when using dataframes to improve performance of all dataframe operations. This will require rewriting many of the dataframe commands to be functional with data tables however I think this could increase performance greatly. Some quality of life changes: -Improve presentation of pseudoreplicates (removal of brackets and quotations around them) -Improve presentation of plot, removal of blank space between 1 and -1 values on plot, also figure out how to improve resolution of pairing numbers on the x-axis since they become hard to read with large classes. -Improve presentation of map, was thinking of possibly figuring out how to show hover text for both lineages of a pairing simultaneously when hovering over lines connecting lineages. This would be so that you dont have to hover over individual points on the map to show hover text relating to a pairing.

Best Regards, Matt

m-orton commented 7 years ago

Oh one I forgot would be to add dropbox integration so that files outputted (fas, csv, tsv) could be directly exported to a specified directory in dropbox.

sadamowi commented 7 years ago

Hi Matt and Jacqueline,

Your ideas towards improving performance sound very interesting. I think those would definitely be ideas worth considering for V2. We may need to rerun the analyses after peer review. Additionally, such performance improvements would would be helpful for future adaptations of the code for new studies.

Best wishes, Sally

sadamowi commented 7 years ago

Explore running analysis through high-performance computing cloud (through Compute Canada).

sadamowi commented 7 years ago

For V2, check the pseudoreplicate identifier. (For V1, check outputted results.)

Example, for Perciformes, here is the list of pseudoreplicates:

("63", "1") ("67", "2") ("15", "11") ("46", "12") ("42", "17") ("46", "24") ("39", "56", "26") ("45", "29") ("33", "32")

In one case, pseudoreplicates are grouped together into a trio. In another case, pair 46 appears in two different pseudoreplicate pairs. I created a pseudoreplicate trio for addressing this for V1.

sadamowi commented 7 years ago

Code in the rule to omit pairs with ingroup distance <2% higher in the pipeline, at the point of forming preliminary pairs. This might free up some BINs to be included in more suitable pairings.

sadamowi commented 7 years ago

For V2, consider the merits of repeating the analysis (perhaps for select groups) using a 0.20 maximum distance criterion for OGs.

m-orton commented 7 years ago

This issue was moved to jmay29/lat-project#5