walaj / svaba

Structural variation and indel detection by local assembly
GNU General Public License v3.0
235 stars 45 forks source link

"Duplicate" kgID when doing annotation. #115

Open sxwcasd opened 2 years ago

sxwcasd commented 2 years ago

I am trying to running the annotation (svaba-annotate.R) on GENCODE db. However, the UCSC db records pulled by these 2 lines. https://github.com/walaj/svaba/blob/0f60e366c300bbefbba762bcc6d2b661bd2ae74a/R/svaba-annotate.R#L59-L60

Are having duplicates. example:

                   kgID         mRNA      geneSymbol   spID       refSeq chrom   txStart     txEnd strand
  1: ENST00000244174.11    NM_002186            IL9R Q01113    NM_002186  chrX 155997695 156010817      +
  2: ENST00000244174.11    NM_002186            IL9R Q01113    NM_002186  chrY  57184215  57197337      +

This is making sense to me, that the sex chromosomes have different position and share the some mRNA. But this will hit error at the following line: Error: !any(duplicated(genes$kgID)) is not TRUE

Maybe we have have a better validation check at here?

walaj commented 2 years ago

Hmm, that is an unusual situation with a gene listed on multiple chromosomes -- interesting catch. Happy to entertain a pull request if you have a better validation scheme.

On Thu, Jun 23, 2022 at 4:37 PM Linghao Song @.***> wrote:

I am trying to running the annotation (svaba-annotate.R) on GENCODE db. However, the UCSC db records pulled by these 2 lines.

https://github.com/walaj/svaba/blob/0f60e366c300bbefbba762bcc6d2b661bd2ae74a/R/svaba-annotate.R#L59-L60

Are having duplicates. example:

               kgID         mRNA      geneSymbol   spID       refSeq chrom   txStart     txEnd strand

1: ENST00000244174.11 NM_002186 IL9R Q01113 NM_002186 chrX 155997695 156010817 + 2: ENST00000244174.11 NM_002186 IL9R Q01113 NM_002186 chrY 57184215 57197337 +

This is making sense to me, that the sex chromosomes have different position and share the some mRNA. But this will hit error at the following line: Error: !any(duplicated(genes$kgID)) is not TRUE https://github.com/walaj/svaba/blob/0f60e366c300bbefbba762bcc6d2b661bd2ae74a/R/svaba-annotate.R#L67

Maybe we have have a better validation check at here?

— Reply to this email directly, view it on GitHub https://github.com/walaj/svaba/issues/115, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUZ7CG5OHOZAOFME3GMGODVQTDI7ANCNFSM5ZVTSDKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sxwcasd commented 2 years ago

Because we need both gencode and exonframe information, we loaded wgEncodeGencodeCompV36 table instead of knownGene or kgXref. And validate by start and end sites. I m not sure if that make sense to your original design (because we don't need to consider refseq id mapping). But it seems accomplished our goals and get around this issue.