limey-bean / Anacapa

Written by Emily Curd (eecurd@g.ucla.edu), Jesse Gomer (jessegomer@gmail.com), Gaurav Kandlikar (gkandlikar@ucla.edu), Zack Gold (zjgold@ucla.edu), Max Ogden (max@maxogden.com), Lenore Pipes (lpipes@berkeley.edu)and Baochen Shi (biosbc@gmail.com). Assistance was provided by Rachel Meyer (rsmeyer@ucla.edu).
MIT License
41 stars 19 forks source link

Anacapa documentation - options for own data #57

Open FabianRoger opened 3 years ago

FabianRoger commented 3 years ago

Hi,

I have managed to install the Anacapa container and run the 12S example successfully.

Now I wanted to to run my own data but I am not sure I find all the options that need to be customized.

I have two markers, COI and 12S

What I did

1) I created an new folder inside Anacapa_db called COI_data

2) I modified the run-anacapa-COI.sh file as follows

3) I changed the Primer in forward_primers.txt & reverse_primers.txt and removed the primers for the barcodes that I didn't have.

4) I changed the expected length in metabarcode_loci_min_merge_length.txt

However, I seem to be missing something

Thanks!

Fabian

zjgold commented 3 years ago

Hi Fabian,

Anacapa uses the primers in the forward_primers.txt, reverse_primers.txt, and the metabarcode_loci_min_merge_length.txt to determine which primers to sort on. There is a primer sort function in the Anacapa toolkit that takes this information, looks for the primers and then bins them into separate folders, each of which will then be separately run through dada2 and classification steps.

If you ended up generating both CO1 and 16S (not 12S your intended target), then I assume one of those 3 files still has the 16S marker in it.

However, from your message I am not exactly understanding what the issue you ran into is. I have a few questions to understand your data:

1) Is this one MiSeq run with both CO1 and 12S data? 2) Are all of the CO1 and 12S sequences in the same input folder?

If these are generated separately, then I would advise running Anacapa twice, once for each marker. You can more markers then are in your dataset, but it will run slower since it will spend a lot of computational time searching through your sequences for the markers in the primers and merge_length.txt files.

FabianRoger commented 3 years ago

Thanks @zjgold . Yes, it is a MiSeq run but the primers were run separately. So running Anacapa twice is probably a good option. I ran it together and expectedly it took quite some time but generated the expected results (separating out by primer).

However, while my reads are paired and have a large overlap, Anacapa didn't generate nochim_merged16S.txt nor nochim_mergedCOI.txt only the corresponding forward and reverse files.

Right now I am stuck with the taxonomic assignment (for which I came here in the first place, see #56).

1) I modelled my separately run dada2 output exactly on the 12S example 2) I replaced the files in the CO1dada2_out folder with my nochim_mergedCOI.fasta and nochim_mergedCOI.txt (and the same for 16S) 3) I downloaded the COI and 16S databases from the google drive, copied them into the Anacapa_dbfolder and named the folder 16S and 12S respectively.

When I run the run-anacapa-classifier.sh script (after pointing it to the right data), the script runs but tells me fro 16S that:

769 reads; of these:
769 (100.00%) were unpaired; of these:
769 (100.00%) aligned 0 times

For COI the script fails with

nochim_mergedCO1.fasta exists
end-to-end
terminate called recursively
Aborted
(ERR): bowtie2-align exited with value 134

Note that the primers I used are not the primers used in the Anacap project but my understanding for how the CRUX-DB datasets for COI and 16S have been generated, this shouldn't be a problem? (blasts the ecoPCR hits back onto the database and chooses the longest sequence, or? That should the cover more than only the primer region used in ecoPCR)

zjgold commented 3 years ago

Hi @FabianRoger . Yeah if it's run separately, I would process them separately. That way you can fine tune the parameters for each marker and run.

Pairing: You may want to check the metabarcode_loci_min_merge_length.txt , sounds like you might have chosen too high of a value which is leading to your reads to not merge. We have recommended settings for these, but you can try shortening it if needed. But it sounds like a parameter input issue here if you have confirmed that you have good overlap with F and R reads. Depending on your marker length the preset CO1 and 16S we have might not be the same as your marker that is targeting this gene region (CO1 and 16S are both > 1000bp long, there are a lot potential metabarcoding targets throughout these gene regions).

Just to confirm you were able to reproduce the 12S test example? Just want to make sure that it's not a dependency issue or other issue related to the implemented of the scripts.

Taxonomic assignment issue: Not sure about the CO1 error @limey-bean any thoughts? Wrong metabarcode: I think this is the real CRUX (pun intended) of the matter. It is true that CRUX grabs the longest matching sequence per accession, but that does not mean it includes all reference sequences for your portion of the CO1 of interest. Presumably you should be getting some alignment, with poor coverage in your reference DB, but if I recall correctly the final references are trimmed to some degree so this may not even be true (again @limey-bean to answer this in far more detail and accuracy). You might just want to do a quick and dirty alignment of 5 of your ASVs with the 1st 100 reference sequences (using clustal or another web based tool) and confirm that they overlap at all. My guess is that the pre-generated CRUX reference sequences are not overlapping with your metabarcoding marker. And this would explain why the 16S ASVs could not be aligned. But would be good to confirm this. If that is the case you would need to generate reference databases matching your loci of interest using CRUX.

FabianRoger commented 3 years ago

Thanks for this. Yes, for 16S, if a different hyper-variable loop is targeted there might well be no overlap. I will try to blast my reads against the provided fasta file and see what I get.

For COI, all primers should be in the Folmer region which is just about 600 bp and most COI sequences should cover that region, here I would be surprised if I didn't get any overlap. But I'll check. If not I will try to generate my own reference database.

update I tried to BLAST my sequences against the Anacapa 16S and COI databases respectively and it looks indeed like there is no overlap...

By the way, I noticed that for CRUX you kept the longest but didn't check if the longest also had the most complete taxonomic annotation? Or did I misunderstand that part?

When I tried to build a reference database previously, I tried to make use of the taxonomic annotations from multiple matches (after 100% clustering) to get a consensus taxonomy. However, it's not trivial as synonyms need to be resolved (I used the Gbif Backbone) and some sort of majority rule implemented to avoid that miss-annotations prevent a consensus...