Issue with the use of the coverage argument as a decimal percentage in PGA sequence extraction

camilababo commented 1 month ago

Hello there,

I am currently working to integrate CRABS into a pipeline for evaluating primers for metabarcoding initiatives. We are particularly interested in the Pairwise Global Alignment feature to extract amplicons without primer-binding regions. However, after a few tests, we've detected an issue with extracting sequence regions from VSEARCH's usearch_global command.

For context, I'm currently using CRABS v0.1.7. In this version, the issue seems to be in the direct use of float(COV) in Step 4 of the pga() method, whereas, in the new version (v1.0.0), this is present in the standalone method in extract_alignment_results(), line 1250 and 1255.

In the current implementation, the methods use the coverage argument to filter sequence regions through the target coverage. However, while percentage identity is expected to be provided as a decimal percentage when it comes to the idcutoff parameter for usearch_global, the values of the columns outputted by VSEARCH are presented in whole percentages. A quick look can be taken at some results we've obtained here: test-vsearch.txt.

CRABS expects both to be provided as decimal percentages: "Both --percent-identity and --coverage should be provided as a percentage value between 0 and 1 (e.g., 95% = 0.95)". This causes the method to extract almost all sequence regions given that, at most, it corresponds to a hit of 1%, resulting in the extraction of much shorter than the provided reference amplicon sequences, such as nonsense sequences as short as 22 nucleotides.

I've re-run this method by multiplying the coverage argument by 100: Screenshot from 2024-10-16 12-11-06

And the results look more accurate: *Folder A is the folder containing the output from the in-silico analysis and Folder B is the folder containing the output from CRABS PGA method. Screenshot from 2024-10-16 12-17-16

I'm curious about your perspective on this—please let me know if you agree with the results we've presented here!

Thank you in advance, Camila

gjeunen commented 1 month ago

Hello @camilababo,

Thank you very much for bringing this issue to our attention. You are correct that the value for --coverage should be between 0 and 100, not between 0 and 1. My apologies for this oversight. Luckily, the current CRABS code (crabs --version 1.0.3) does not have to change, but I have specified the correct input value format in the crabs -h help documentation, as well as corrected the README document in crabs --version 1.0.4. I hope this helps.

Good luck with the development of your pipeline, it sounds like an interesting program.

Best wishes, Gert-Jan

camilababo commented 1 month ago

Thank you for your response, and congratulations on developing such a helpful tool!

All the best, Camila

gjeunen / reference_database_creator

Issue with the use of the coverage argument as a decimal percentage in PGA sequence extraction #73