gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
22 stars 8 forks source link

PGA difference between conda and docker versions? but better with conda?? #45

Open eddydowle opened 7 months ago

eddydowle commented 7 months ago

Just noticed that the conda version of crabs seems to do a better job of the PGA step than the docker (but I thought the conda version was older).

For example in conda: module load conda conda activate crabs crabs db_download --source ncbi --database nucleotide --query 'cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title] OR ("complete"[Title] AND mitochondrion[filter] AND ("8000"[SLEN] : "30000"[SLEN])) AND Arthropoda[Organism] AND mitochondrion[filter] NOT environmental sample[Title] NOT environmental samples[Title]' --output coi_ncbi.fasta --keep_original yes --email eddy.dowle@plantandfood.co.nz --batchsize 2000 crabs db_download --source bold --database 'Arthropoda' --output bold_arthropoda.fasta --keep_original yes crabs db_merge --output output_COI_ncbi_bold_arthropoda.fasta --uniq yes --input coi_ncbi.fasta bold_arthropoda.fasta crabs insilico_pcr --input output_COI_ncbi_bold_arthropoda.fasta --output output_COI_ncbi_bold_arthropoda_pcr.fasta --fwd ACWGGWTGRACWGTNTAYCC --rev TCDGGRTGNCCRAARAAYCA --error 4.5 crabs pga --input output_COI_ncbi_bold_arthropoda.fasta --output output_COI_ncbi_bold_arthropoda_pcr_pga.fasta --database output_COI_ncbi_bold_arthropoda_pcr.fasta --fwd ACWGGWTGRACWGTNTAYCC --rev TCDGGRTGNCCRAARAAYCA --speed medium --percid 0.90 --coverage 0.90 --filter_method strict crabs assign_tax --input output_COI_ncbi_bold_arthropoda_pcr_pga.fasta --output output_COI_ncbi_bold_arthropoda_pcr_pga.tsv --acc2tax nucl_gb.accession2taxid --taxid nodes.dmp --name names.dmp crabs dereplicate --input output_COI_ncbi_bold_arthropoda_pcr_pga.tsv --output output_COI_ncbi_bold_arthropoda_pcr_pga_derep.tsv --method uniq_species crabs seq_cleanup --input output_COI_ncbi_bold_arthropoda_pcr_pga_derep.tsv --output output_COI_ncbi_bold_arthropoda_pcr_pga_derep_clean.tsv --minlen 100 --maxlen 500 --maxns 4 --enviro yes --nans 0

I get a file with: wc -l output_COI_ncbi_bold_arthropoda_pcr_pga_derep_clean.tsv 608629 output_COI_ncbi_bold_arthropoda_pcr_pga_derep_clean.tsv 608629 sequences

But with the docker using the same commands: module load singularity singularity exec crabs_0.1.4.sif crabs db_download --source ncbi --database nucleotide --query 'cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title] OR ("complete"[Title] AND mitochondrion[filter] AND ("8000"[SLEN] : "30000"[SLEN])) AND Arthropoda[Organism] AND mitochondrion[filter] NOT environmental sample[Title] NOT environmental samples[Title]' --output coi_ncbi.fasta --keep_original yes --email eddy.dowle@plantandfood.co.nz --batchsize 2000 singularity exec crabs_0.1.4.sif crabs db_download --source bold --database 'Arthropoda' --output bold_arthropoda.fasta --keep_original yes singularity exec crabs_0.1.4.sif crabs db_merge --output output_COI_ncbi_bold_arthropoda.fasta --uniq yes --input coi_ncbi.fasta bold_arthropoda.fasta singularity exec crabs_0.1.4.sif crabs insilico_pcr --input output_COI_ncbi_bold_arthropoda.fasta --output output_COI_ncbi_bold_arthropoda_pcr.fasta --fwd ACWGGWTGRACWGTNTAYCC --rev TCDGGRTGNCCRAARAAYCA --error 4.5 singularity exec crabs_0.1.4.sif crabs pga --input output_COI_ncbi_bold_arthropoda.fasta --output output_COI_ncbi_bold_arthropoda_pcr_pga.fasta --database output_COI_ncbi_bold_arthropoda_pcr.fasta --fwd ACWGGWTGRACWGTNTAYCC --rev TCDGGRTGNCCRAARAAYCA --speed medium --percid 0.90 --coverage 0.90 --filter_method strict singularity exec crabs_0.1.4.sif crabs db_download --source taxonomy singularity exec crabs_0.1.4.sif crabs assign_tax --input output_COI_ncbi_bold_arthropoda_pcr_pga.fasta --output output_COI_ncbi_bold_arthropoda_pcr_pga.tsv --acc2tax nucl_gb.accession2taxid --taxid nodes.dmp --name names.dmp --missing missing_taxa.tsv singularity exec crabs_0.1.4.sif crabs dereplicate --input output_COI_ncbi_bold_arthropoda_pcr_pga.tsv --output output_COI_ncbi_bold_arthropoda_pcr_pga_derep.tsv --method uniq_species singularity exec crabs_0.1.4.sif crabs seq_cleanup --input output_COI_ncbi_bold_arthropoda_pcr_pga_derep.tsv --output output_COI_ncbi_bold_arthropoda_pcr_pga_derep_clean.tsv --minlen 100 --maxlen 500 --maxns 4 --enviro yes --nans 0 I get: wc -l output_COI_ncbi_bold_arthropoda_pcr_pga_derep_clean.tsv 176696 output_COI_ncbi_bold_arthropoda_pcr_pga_derep_clean.tsv

176696 sequences in contrast to the 608629 sequences returned by the conda version. looking through the intermediate files its definitely the PGA step that is variable between the versions and the conda version is keeping stuff that is useful (e.g. stuff on target but missing the primer regions of one or both ends.

eddydowle commented 7 months ago

Thats an awful code block, but I can send you the scripts