gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
24 stars 8 forks source link

PGA output file name override #48

Open jordanpcuff opened 8 months ago

jordanpcuff commented 8 months ago

Dear CRABS team,

Firstly, thank you for creating such a user-friendly resource. I have found CRABS almost worryingly easy to use and follow which is great. I am having a few issues though, particularly with the PGA step. The output appears to be saving as 'CRABS_pga.fasta' regardless of my output specification. I ignored this at first, but the file doesn't seem to overwrite when rerunning this step, and I'm having some other downstream issues (described below) with no discernible source, so I'm wondering if this is indicative of any other issues. Here's my input:

docker run --rm -it \
  -v "$(pwd)":/data \
  --workdir="/data" \
  quay.io/swordfish/crabs:0.1.4 \
  crabs pga --input arthropoda_coi.fasta --output Arthropoda_coi_merged_LCO230_pga.fasta --database Arthropoda_coi_merged_LCO230.fasta --fwd GGTCAACAAATCATAAAGAYATYGG --rev CTTATRTTRTTTATNCGNGGRAANGC --speed medium --percid 0.95 --coverage 0.95 --filter_method strict

I am using CRABS with DADA2 and getting a lot of very poorly resolved assignments or misassignments (ASVs identified as exotic decapods from my terrestrial insect samples). When I BLAST these same sequences, I get reasonable terrestrial insect species (I understand the issues with BLAST, but the results appear to be relatively unambiguous and for insect COI metabarcoding I would expect a fairly decent resolution/coverage), so I can only assume that the above issue could be causing this somehow, but any other suggestions on how to resolve this would be greatly appreciated.

This may simply be my naivety though, and perhaps a user error or misunderstanding!

Many thanks,

Jordan

gjeunen commented 8 months ago

Hello @jordanpcuff,

Thank you for using CRABS!

I wonder if your problem is associated with the docker installation, since I cannot recreate the issue from the GitHub and conda instalments. @hughcross, could you please see if you can recreate the above mentioned issue in docker?

Just FYI, if the PGA step is not working as intended, I'm not surprised you are currently seeing some "weird" taxonomy assignments for your insect data. With COI, there will be a lot of reference barcodes that have either used one or both of your primers, which will have been cut off and currently not present in your ref DB. Once we get the PGA step working, you could try and restrict your reference sequences to non-marine organisms to see if it improves the taxonomy assignment.

Best, Gert-Jan