thiagomaf / CSCpromoters

The goal of CSCpromoters is to extract, export and plot sequences of promoter regions from the genome fasta files.
MIT License
3 stars 1 forks source link

Problem with bad/duplicated loci entries #29

Open thiagomaf opened 2 months ago

thiagomaf commented 2 months ago

Deal with duplicated entries. Issue illustrated in the README plot (loci chr4Hg0380431, chr4Hg0383761, or chr5Hg0547031).

Duplicated loci are locus id that appears more than once in the annotation tables or FASTA files. Often they are quite short (about 200bp) and occur in opposite directions (i.e. one Forward and the other Reversed). These are problematic for two reasons (1) they introduce artifacts to the extracted promoter sequences (i.e. the upstream reference locus is false), (2) they might indicate poor sequencing/mapping (i.e. the duplicated locus might be part of the upstream or downstream gene).

Aleksandr has propose to filter out those duplicated loci. This solution assumes the duplicated loci are technical artifacts and ignores then. This implicates that these problematic sequences of duplicated loci might be present in the output promoter sequences of its downstream locus. His proposal will be incorporated and implemented as a function, to be used prior to the analysis ad hoc.

Aleksandr's solution should be takens as temporary though. A more definitive solution would involve dealing with the duplicated loci on the fly. Perhaps this could be achieved by using the Aleksandr's filtering solution after the query retrieval (e.g. in get_promoter_sequences.R before 127

thiagomaf commented 2 months ago