mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 7 forks source link

How exons should be specified in the reference FASTA file? #33

Open anya-pavlova opened 2 months ago

anya-pavlova commented 2 months ago

Hi,

I tried to launch T1K on my data, but got segfault with the following trace

#0  0x000055555556cd03 in SeqSet::SetSeqExonInfo (exons=std::vector of length 26, capacity 32 = {...}, seqIdx=14, this=0x7fffffffbc10) at /data1/biosoft/T1K/SeqSet.hpp:669
#1  SeqSet::InputRefSeq (this=this@entry=0x7fffffffbc10, id=id@entry=0x5555647182c0 "chr15", read=read@entry=0x7fffee8f4010 'N' <repeats 200 times>..., weight=weight@entry=1,     initExonInfo=initExonInfo@entry=true,     comment=comment@entry=0x55558dff59d0 " AC:CM000677.2  gi:568336009  LN:101991189  rl:Chromosome  M5:f036bd11158407596ca6bf3581454706  AS:GRCh38")    at /data1/biosoft/T1K/SeqSet.hpp:979
#2  0x0000555555571806 in Genotyper::InitRefSet (this=this@entry=0x7fffffffb9d0, 
    filename=filename@entry=0x5555555950d0 "/data1/mnt/LMBALL/RefData_VMP01/references/38/GDC/GRCh38.d1.vd1.fa") at /data1/biosoft/T1K/Genotyper.hpp:700
#3  0x00005555555576d1 in main (argc=14, argv=0x7fffffffdfe8) at Genotyper.cpp:331

As I understand from the source code, T1K expects that a reference FASTA file will contain exons in the metadata lines. But metadata lines in my FASTA files look like that

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

I think that what is causing segfault

What is an expected format for metadata lines in FASTA files for T1K?

Also I am very curious about this loop. Why does it start with 1 and not 0? And it seems that it also assumes that size is odd (otherwise there will be read out of bounds). Is that something expected?

Thank you in advance

mourisl commented 2 months ago

How did you create the fasta file?

Using the reference file for the WGS/WES data as an example, the header in the fasta file should be something like:

>HLA-A*01:01:01:01 8 50 122 253 522 764 1039 1441 1716 1819 1935 2337 2369 2512 2559 2729 2733

After the allele, the first number is the number of exons, and then pairs of number on the fasta sequences corresponds to one exon ([start,end], 0-based coordinate). That's also the reason the loop starts at 1.