jodyphelan / tbdb

Standard database for the TBProfiler tool
GNU Lesser General Public License v3.0
28 stars 18 forks source link

Custom database questions #45

Open mbhall88 opened 1 year ago

mbhall88 commented 1 year ago

I'm having some issues trying to create a custom database.

My understanding from the documentation is that I clone this repo, and then replace/change the tbdb.csv file to have the mutations I want, then I run parse_db.py in the main directory?

It seems there is a file missing? And I can't find it documented anywhere

$ python parse_db.py -c tbdb.csv --custom
Traceback (most recent call last):
  File "/Users/michaelhall/Projects/drprg/paper/tmp/tbdb/parse_db.py", line 281, in <module>
    args.func(args)
  File "/Users/michaelhall/Projects/drprg/paper/tmp/tbdb/parse_db.py", line 202, in main
    gene_info = load_gene_info("genes.txt")
  File "/Users/michaelhall/Projects/drprg/paper/tmp/tbdb/parse_db.py", line 187, in load_gene_info
    for l in open(filename):
FileNotFoundError: [Errno 2] No such file or directory: 'genes.txt'

I then instead tried running the following from the tbdb main directory

$ tb-profiler create_db --custom --include_original_mutation

this completes successfully, but I have a further issue with the output of this.

As per the docs, the mutations must follow HGVS nomenclature. But it seems tb-profiler only accepts a subset of this nomenclature.

For example, I have the mutation c.196_198delinsTAG, which describes an MNP at position 196 TCG>TAG. Looking at the tbdb.conversion.log this (incorrectly) gets converted as

Converted pncA c.196_198delinsTAG to c.196_198delTCG

Are you able to clarify (here and in the docs) what subset you support?

mbhall88 commented 1 year ago

I've also notice you don't accept duplications in the recommended format? i.e. c.643dup must specify the duplicated base at the end e.g., c.643dupC

jodyphelan commented 1 year ago

Hi @mbhall88 ,

Sorry I need to update the documentation. You are right in using tb-profiler create_db instead.

As per the docs, the mutations must follow HGVS nomenclature. But it seems tb-profiler only accepts a subset of this nomenclature. For example, I have the mutation c.196_198delinsTAG, which describes an MNP at position 196 TCG>TAG. Looking at the tbdb.conversion.log this (incorrectly) gets converted.

Yes at the moment it is only a subset, which it accepts. The pipeline uses snpEff to annotate variants in new samples and only represents the variants in one way (e.g. c.643dupC instead c.643dup). To simplify the variant looup step the create_db function tried to standardise all variants to the snpEff format using regex, but currently I've only added support for the variants that are tbdb.csv. I'll try over the next days to update the docs and look into adding compatibility for more types such as the one you listed.

Thanks for raising the issue!

mbhall88 commented 1 year ago

Thanks for the clarification. Trying to support all of HGVS would likely be difficult, and would likely require developing a library. I just noticed https://github.com/biocommons/hgvs though! I haven't used it before, but looks like it might make your life a little easier potentially?

Anyways, I got a custom db working and just thought this issue might be helpful just for some docs changes.

Thanks for the quick response.

jodyphelan commented 1 year ago

Oh I hadn't seen that before, I'll check it out thanks! And, I'll have a go at updating the docs asap.