Closed camilababo closed 1 month ago
Hello @camilababo,
Thank you very much for bringing this issue to our attention. You are correct that the value for --coverage
should be between 0 and 100, not between 0 and 1. My apologies for this oversight. Luckily, the current CRABS code (crabs --version 1.0.3
) does not have to change, but I have specified the correct input value format in the crabs -h
help documentation, as well as corrected the README document in crabs --version 1.0.4
. I hope this helps.
Good luck with the development of your pipeline, it sounds like an interesting program.
Best wishes, Gert-Jan
Thank you for your response, and congratulations on developing such a helpful tool!
All the best, Camila
Hello there,
I am currently working to integrate CRABS into a pipeline for evaluating primers for metabarcoding initiatives. We are particularly interested in the Pairwise Global Alignment feature to extract amplicons without primer-binding regions. However, after a few tests, we've detected an issue with extracting sequence regions from VSEARCH's
usearch_global
command.For context, I'm currently using CRABS v0.1.7. In this version, the issue seems to be in the direct use of
float(COV)
in Step 4 of thepga()
method, whereas, in the new version (v1.0.0), this is present in the standalone method inextract_alignment_results()
, line 1250 and 1255.In the current implementation, the methods use the
coverage
argument to filter sequence regions through the target coverage. However, while percentage identity is expected to be provided as a decimal percentage when it comes to theidcutoff
parameter forusearch_global
, the values of the columns outputted by VSEARCH are presented in whole percentages. A quick look can be taken at some results we've obtained here: test-vsearch.txt.CRABS expects both to be provided as decimal percentages: "Both
--percent-identity
and--coverage
should be provided as a percentage value between 0 and 1 (e.g., 95% = 0.95)". This causes the method to extract almost all sequence regions given that, at most, it corresponds to a hit of 1%, resulting in the extraction of much shorter than the provided reference amplicon sequences, such as nonsense sequences as short as 22 nucleotides.I've re-run this method by multiplying the coverage argument by 100:
And the results look more accurate: *Folder A is the folder containing the output from the in-silico analysis and Folder B is the folder containing the output from CRABS PGA method.
I'm curious about your perspective on this—please let me know if you agree with the results we've presented here!
Thank you in advance, Camila