Closed heliziii closed 1 month ago
Hi Helia!
Did you use both parascopy pretable
and parascopy table
?
For the human genome I have 600k regions, so 1.1M does not sound too bad. One important factor is that the duplications in the output table are pairwise, meaning that the three-copy duplication will have 3x2 = 6 entries in the table: at each copy there will be two entries refering to the two other copies.
As for filtering, you can use parascopy view
and filter based on length (ALENGTH), sequence similarity (SS) and other fields. Other useful fields are: SEP
- distance between two copies; NM
- edit distance, DIFF
- number of mismatches,deletions,insertions; compl
- sequence complexity, smaller values represent homopolymers and other short repeats; av_mult
is similar: how many times each unique 11-mer appears in the sequence.
Finally, you can use parascopy examine
to combine multi-copy repeats together. This way you can identify multi-copy repeats. But note that this output will be riddled with shorter repeats.
Also, sometimes there are tangled
regions, where duplication structure is too complex to process.
thank you!
Hello,
Thank you for your tool! We have been using Parascopy to find SEGDUP coordinates for the rat reference genome, and I had a couple of clarifying questions.
The generated table by parascopy has ~1.1M regions which seems a lot. Could you provide some insights into whether any filtering has been applied to your precomputed tables for the human genome?
Additionally, I couldn't find documentation on how to interpret certain fields in the table file, like SS or ALENGTH. Could you please provide some guidance on this?
Best regards, Helia