tseemann / snippy

:scissors: :zap: Rapid haploid variant calling and core genome alignment
GNU General Public License v2.0
462 stars 113 forks source link

Masking issue #265

Closed pgcudahy closed 4 years ago

pgcudahy commented 5 years ago

I'm have a feeling that this is a silly formatting issue but I keep getting the error from snippy-core ERROR: BED [Chrom ChromStart ChromEnd locus tag Comment] has CHR that is not in the reference

My command is snippy-core --prefix core-SNPs --mask /data/pilot_strain_heterogeneity/data/interim/snippy/Locus_to_exclude_Mtb_new.bed --ref /data/pilot_strain_heterogeneity/data/external/NC_000962/NC_000962.3.fasta /data/pilot_strain_heterogeneity/data/interim/snippy/PC12_S41 /data/pilot_strain_heterogeneity/data/interim/snippy/PC13_S20

My BED file looks like this

head /data/pilot_strain_heterogeneity/data/interim/snippy/Locus_to_exclude_Mtb_new.bed
Chrom   ChromStart      ChromEnd        locus tag       Comment
gi|448814763|ref|NC_000962.3|   23182   23269   IG18_Rv0018c-Rv0019c
gi|448814763|ref|NC_000962.3|   33582   33794   Rv0031  remnant of A transposase
gi|448814763|ref|NC_000962.3|   80194   80623   IG71_Rv0071-Rv0072
gi|448814763|ref|NC_000962.3|   103710  104663  Rv0094c 50bp_duplicated
gi|448814763|ref|NC_000962.3|   104663  104805  IG_Rv0094c-Rv0095c
gi|448814763|ref|NC_000962.3|   104805  105215  Rv0095c 50bp_duplicated
gi|448814763|ref|NC_000962.3|   105215  105324  IG_Rv0095c-Rv0096
gi|448814763|ref|NC_000962.3|   105324  106715  Rv0096  PPE family protein
gi|448814763|ref|NC_000962.3|   131382  132872  Rv0109  PE-PGRS family protein

And the reference is

head /data/pilot_strain_heterogeneity/data/external/NC_000962/NC_000962.3.fasta
>gi|448814763|ref|NC_000962.3| Mycobacterium tuberculosis H37Rv, complete genome
TTGACCGATGACCCCGGTTCAGGCTTCACCACAGTGTGGAACGCGGTCGTCTCCGAACTTAACGGCGACC
CTAAGGTTGACGACGGACCCAGCAGTGATGCTAATCTCAGCGCTCCGCTGACCCCTCAGCAAAGGGCTTG
GCTCAATCTCGTCCAGCCATTGACCATCGTCGAGGGGTTTGCTCTGTTATCCGTGCCGAGCAGCTTTGTC
CAAAACGAAATCGAGCGCCATCTGCGGGCCCCGATTACCGACGCTCTCAGCCGCCGACTCGGACATCAGA
TCCAACTCGGGGTCCGCATCGCTCCGCCGGCGACCGACGAAGCCGACGACACTACCGTGCCGCCTTCCGA
AAATCCTGCTACCACATCGCCAGACACCACAACCGACAACGACGAGATTGATGACAGCGCTGCGGCACGG
GGCGATAACCAGCACAGTTGGCCAAGTTACTTCACCGAGCGCCCGCACAATACCGATTCCGCTACCGCTG
GCGTAACCAGCCTTAACCGTCGCTACACCTTTGATACGTTCGTTATCGGCGCCTCCAACCGGTTCGCGCA
CGCCGCCGCCTTGGCGATCGCAGAAGCACCCGCCCGCGCTTACAACCCCCTGTTCATCTGGGGCGAGTCC

for comparison, the bed files that snippy has generated from the same data look like this

head PC12_S41/snps.bed
gi|448814763|ref|NC_000962.3|   1848    1849
gi|448814763|ref|NC_000962.3|   1976    1977
gi|448814763|ref|NC_000962.3|   4012    4013
gi|448814763|ref|NC_000962.3|   7361    7362
gi|448814763|ref|NC_000962.3|   7584    7585
gi|448814763|ref|NC_000962.3|   9303    9304
gi|448814763|ref|NC_000962.3|   11819   11820
gi|448814763|ref|NC_000962.3|   11878   11879
gi|448814763|ref|NC_000962.3|   14784   14785
gi|448814763|ref|NC_000962.3|   14860   14861

I'm using snippy-core 4.3.2 and have tried using sed to change the bed file to all kinds of different chrom fields but no success. I've also tried removing the header and the metadata columns, but then get ERROR: BED [gi|448814763|ref|NC_000962.3| 2356729 2358206] has CHR that is not in the reference

tseemann commented 5 years ago

The first line of your BED file is not valid. BED does not support a "header" row. Also BED is tab-separated; are you sure you aren't using spaces accidently?

peflanag commented 4 years ago

Hi,

I thought I would tag on here because I want to make a BED file to mask repetitive regions in my M tuberculosis SNP analsysis. My question is how do I make a BED file and can someone, possibly @pgcudahy tell me what is the general consensus region of genes to exclude? I'm new to the TB world. I was using MTBseq and tried Snippy today and their is quite a difference between both forms of analysis.

Cheers.

tseemann commented 4 years ago

For M.tb we use the regions from this website: https://gph.niid.go.jp/tgs-tb/

I have now added a BED version of that XLSX file to Snippy: https://raw.githubusercontent.com/tseemann/snippy/master/etc/Mtb_NC_000962.3_mask.bed

It's mentioned in the docs here now: https://github.com/tseemann/snippy#options-1

Just do snippy-core --mask Mtb_NC_000962.3_mask.bed ....

But make sure you use the correct reference version (3), and that it's FASTA ID is NC_000962.3 to match the name in column 1 of the BED file.

https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3

peflanag commented 4 years ago

Thats perfect cheers, I am running version 4.4.3

Thanks so much

tseemann commented 4 years ago

You will need to manually get the bed file from github - i have not released a new version yet

peflanag commented 4 years ago

Oh ok. This is a stupid question then, the link above with the BED file info, do I copy it into excel and save as an xlxs file? And then change the .xlxs part to .BED before putting it in the snippy/etc/ path?

tseemann commented 4 years ago
wget https://raw.githubusercontent.com/tseemann/snippy/master/etc/Mtb_NC_000962.3_mask.bed

You don't need to put it into etc/ folder - you have to put it wherever you need it, and give it to the --mask option in snippy-core.

peflanag commented 4 years ago

Thanks so much, I really appreciate your help!