jodyphelan / TBProfiler

Profiling tool for Mycobacterium tuberculosis to detect ressistance and strain type from WGS data
GNU General Public License v3.0
104 stars 43 forks source link

Documentation for VCF as input #394

Open mbhall88 opened 2 weeks ago

mbhall88 commented 2 weeks ago

Hi Jody,

I have just been running tbprofiler with some samples using VCF as the input (it is ONT data I have variant-called with Clair3). Forgive me if I have missed it somewhere but there doesn't seem to be any documentation about what is expected of the VCF?

For future me (and maybe others) the VCF needs to be indexable - i.e., BGZIP-compressed VCF (.vcf.gz) or BCF. And the other thing which I found a little more sinister was that the CHROM names must be Chromosome. I had them as NC_000962.3 and tbprofiler ran without any errors, but I essentially got not resistance predictions. When I changed the CHROM name in the VCF I got the expected predictions.

My hacky/fast way of making this change was

bcftools view in.vcf.gz | sed 's/NC_000962.3/Chromosome/g' | bcftools view -o out.bcf

and then run tbprofiler with -v out.bcf.

I guess a more robust solution would be to use BCFtools

echo -e 'NC_000962.3\tChromosome' | bcftools annotate --rename-chrs -  -o out.bcf in.vcf.gz

Anyway, maybe some of these examples could be added to the docs? I know I would find it useful, so maybe others would too?

jodyphelan commented 2 weeks ago

Hi Michael

Apologies for the awful documentation, I really need to invest some time into improving them! I will try to put together a section on what it looks for in a VCF.

Yes the default database uses 'Chromosome' as the chromosome name. If you would like to use your VCFs with a different chromosome name then I would recommend doing --match_ref </path/to/your/refrence.fasta> in update_db or create_db which will use whatever name is in your own fasta file. Again as you pointed out this isn't very clear, so I'll try maybe make a little decision tree figure on datainputs and recomended settings.

The fact it doesn't complain when you feed it a VCF with different chrom names is pretty critical! I'll put in a fix for that and make a new release asap!

And I didn't know abut --rename-chrs section on bcftools, I'm using my own hacky script internally but this is far more elegant!

mbhall88 commented 2 weeks ago

No worries. It's hard to keep docs updated as a tool evolves.

Personally, just renaming the chrom in the VCF as I outlined above is probably an easier route than updating the DB. It's also totally fine to expect users to do this, and I guess I kind of created this issue to show an example pf how I achieved it. Selfishly for future me, but hopefully others find it useful. Also, feel free to use it in the docs if you think it is helpful.

Thanks again for keeping TBProfiler updated and evolving.