Filtering output of table command

heliziii commented 4 months ago

Hello,

Thank you for your tool! We have been using Parascopy to find SEGDUP coordinates for the rat reference genome, and I had a couple of clarifying questions.

The generated table by parascopy has ~1.1M regions which seems a lot. Could you provide some insights into whether any filtering has been applied to your precomputed tables for the human genome?

Additionally, I couldn't find documentation on how to interpret certain fields in the table file, like SS or ALENGTH. Could you please provide some guidance on this?

Best regards, Helia

tprodanov commented 4 months ago

Hi Helia!

Did you use both parascopy pretable and parascopy table?

For the human genome I have 600k regions, so 1.1M does not sound too bad. One important factor is that the duplications in the output table are pairwise, meaning that the three-copy duplication will have 3x2 = 6 entries in the table: at each copy there will be two entries refering to the two other copies.

As for filtering, you can use parascopy view and filter based on length (ALENGTH), sequence similarity (SS) and other fields. Other useful fields are: SEP - distance between two copies; NM - edit distance, DIFF - number of mismatches,deletions,insertions; compl - sequence complexity, smaller values represent homopolymers and other short repeats; av_mult is similar: how many times each unique 11-mer appears in the sequence.

Finally, you can use parascopy examine to combine multi-copy repeats together. This way you can identify multi-copy repeats. But note that this output will be riddled with shorter repeats.

Also, sometimes there are tangled regions, where duplication structure is too complex to process.

heliziii commented 1 month ago

thank you!

tprodanov / parascopy

Filtering output of table command #9