Closed ddepierre closed 1 year ago
--max-mismatch 500
- this is a very high value for this argument, do you have a specific biological reason to use 500 there? More typical would be not more than 3... I suspect this is somehow causing this problem...
If you have to use it, you can try using --backend cython
.
If you could share the file with the pairs that causes this error, it would help us a lot.
Ok it worked fine with --max-mismatch 3
and --backend cython
Can --max-mismatch 500 be used to filter out pairs with a distance <500bp (even if of course it is not the purpose of the dedup function) ? Or do I misunderstand this option?
Generally, dedup considers pairs with exactly matching coordinates to be duplicates of each other. However there might occasionally be pairs shifted by a few base pairs that might also be duplicates. This option sets the threshold for this inexact matching (e.g. pairs with coordinates (3, 5), (3, 5) are also duplicates, but (3, 5), (3, 6) - only with --max-mismatch 1
or higher.
So in principle you could use a large number if you are using some unusual protocol that would generate such duplicates... But what can happen is that the algorithm tries to annotate all pairs as duplicates of each other and strange things can happen (e.g. it takes forever, it runs out of memory, etc). Cython backend wouldn't have a problem like that, but still you'd end up with no pairs in the end.
Glad this solved your problem!
With --max-mismatch 3
I would recommend switching to --backend cython
. Cython and the default engine produce the same results, but Cython is more appropriate for massive numbers of duplicates (usually happens in low-complexity libraries or single-cell analysis).
Hi,
I am trying to run dedup command, but the command is running super long, I get error msgs and then my command is stopped/killed.
Command:
pairtools dedup --output-stats 04_PAIRTOOLS/stats/sample.04_dedup.stats --max-mismatch 500 04_PAIRTOOLS/sam/sample.03_selected.temp.sam --output 04_PAIRTOOLS/sam/sample.04_dedup.temp.sam
with sample.03_selected.temp.sam being a sam/pairs output from pairtools selectError:
I tried the same command line on other samples with about the same number of pairs, it works well in less than 10min, while on some other samples, it takes >10hours before I get the value type error.
Do you think it could come from the reads themselves? Here are some pairs.sam on which it runs fast:
Here are some pairs.sam on which it runs super slow and then outputs an error:
Also when I take only a sample of my sample.03_selected.temp.sam pairs, it works fine. So I don't exactly get where the problem comes from and why it taeks so long to process the second type of pairs/reads.
Best, David