How to filter the final.bed result?

Yujiaxin419 commented 4 years ago

Dear professor: Thank you for developing this efficient and easy-to-use program. I have same question about my data. I worked with a plant genome whose length is ~350 mb. At first, I soft masked this genome with RepeatMasker and bedtools. through sedef pipeline, I finally generated a result file named 'final.bed'. I only retaining whose length > 1000 and fracMatch > 0.9 . But after filtering, I still get ~330 mb sd result. I think its too high. But I dont know how to filter out my result data. could you please give me some advice about this question? thanks a lot. Yujiaxin

inumanag commented 4 years ago

Hi @Yujiaxin419

Here are some hints:

Try using fracMatchIndel. fracMatch ignores gaps, and if you have large gaps, it will consider large gaps as "SD-aligned" bases. You can use CIGARs as well for better filtering.
Another cause might be tandem SDs, where one SD might be counted many times (e.g. if you have region A copied to B, C, D, and E, SEDEF will report A-B, A-C, A-D, A-E, B-C, B-D, B-E, C-D, C-E and D-E).
Make sure that your genome is masked--- otherwise many of the reported SDs might be just common repeats.

In general, SD filtering is not a trivial process because an SD that has 75% similarity might contain sub-SDs that have 90%+ similarity. You will often need to manually parse CIGAR strings and use a criteria of your choice to filter/extract those SDs.

Ibrahim

Yujiaxin419 commented 4 years ago

Dear professor:

Thank you for your useful suggestions. I will consider your advices carefully.

Yujiaxin

vpc-ccg / sedef

How to filter the final.bed result? #20