vpc-ccg / sedef

Identification of segmental duplications in the genome
MIT License
26 stars 8 forks source link

How to filter the final.bed result? #20

Closed Yujiaxin419 closed 4 years ago

Yujiaxin419 commented 4 years ago

Dear professor: Thank you for developing this efficient and easy-to-use program. I have same question about my data. I worked with a plant genome whose length is ~350 mb. At first, I soft masked this genome with RepeatMasker and bedtools. through sedef pipeline, I finally generated a result file named 'final.bed'. I only retaining whose length > 1000 and fracMatch > 0.9 . But after filtering, I still get ~330 mb sd result. I think its too high. But I dont know how to filter out my result data. could you please give me some advice about this question? thanks a lot. Yujiaxin

inumanag commented 4 years ago

Hi @Yujiaxin419

Here are some hints:

In general, SD filtering is not a trivial process because an SD that has 75% similarity might contain sub-SDs that have 90%+ similarity. You will often need to manually parse CIGAR strings and use a criteria of your choice to filter/extract those SDs.

Ibrahim

Yujiaxin419 commented 4 years ago

Dear professor:

Thank you for your useful suggestions. I will consider your advices carefully.

Yujiaxin