thiagomaf / CSCpromoters

The goal of CSCpromoters is to extract, export and plot sequences of promoter regions from the genome fasta files.
MIT License
3 stars 1 forks source link

Problem when querying locus ids - unsanitized locus ids #28

Open thiagomaf opened 1 month ago

thiagomaf commented 1 month ago

It has been a while since I have noticed that a few entries would not be returned despite being present in the genome file. E.g. I would not get promoters back when querying "chrXHg123456" if the locus_id value annotated were "chrXHg123456 (GENE1)". This kind of bad annotations practices are rare but have happened at least once with me in the past.

Problem is in lines 121-125 of the get_promoter_sequences.serial() function.

Lines 121-123 were commented and substituted by lines 124-125 in an attempt to fix this issue. Code broke (Aleksandr reported). As a temporary fix, code will be reverted. More definitive solution will likely involve sanitazing annotations after they are loaded.

Function get_promoter_sequences.parallel is unafected due to a louzy code maintainer (i.e. I forgot it existed).

thiagomaf commented 1 month ago

Aleksandr has proposed a solution involving sanitizing the loci ids prior to the analysis/query of promoters. As I see, this is a viable but undesirable solution since it will bring to the user more responsability of the analyses. Ideally, we would implement a "on the fly" solution (i.e. dealing with the issues during runtime).

This would implicate either (1) recreating the search function from GenomicFeatures, or (2) treating the returned results after the query. The latter might not be viable since the GenomicFeatures function might not return anything. It would work tho if GenomicFeatures accepts wildcards in the query values.