ogotoh / spaln

Genome mapping and spliced alignment of cDNA or amino acid sequences
GNU General Public License v2.0
93 stars 16 forks source link

Spaln crashed when mapping mouse proteins to the human genome #57

Closed lh3 closed 2 years ago

lh3 commented 2 years ago

I am using version 2.4.12, checked out on August 14, 2022. I compiled from source code on CentOS 7. Here are the command lines:

src/spaln -Whs38.kkp -KP hs38.fa  # this is ok
src/spaln -Q7 -t16 -O0 -dhs38.bkp mm39.canon.fa > out.gff  # this is ok
src/spaln -Q7 -t16 -O0 -LS -dhs38.bkp mm39.canon.fa   # crashed
src/spaln -Q7 -t16 -O0 -Thomosapi -dhs38.bkp mm39.canon.fa  # crashed
src/spaln -Q7 -t16 -O0 -yS -Thomosapi -dhs38.bkp mm39.canon.fa  # crashed

There are about ~20k "canonical" mouse proteins in mm39.canon.fa. All three crashes happened after aligning ~10k proteins. The system error message looks like:

*** Error in `src/spaln': double free or corruption (!prev): 0x00002aab9c02ec00 ***

I have put the input sequences at:

ftp://ftp.dfci.harvard.edu/pub/hli/tmp/spaln/

I am not sure if the crash can be reproduced on your end, though.

ogotoh commented 2 years ago

Sorry but I cannot get access to the FTP site you noted. It is most convenient for me if you send me in any means the particular amino acid sequence(s) that leads to the segmentation fault.

Osamu,

lh3 commented 2 years ago

I am using reference genome from this link:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

The protein sequence that causes the problem is:

>ENSMUSP00000074009.7
MWRADRWAPLLLFLLQSALGRPRLAPPRNVTLFSQNFTVYLTWLPGLGSPPNVTYFVTYQSYIKTGWRPVEHCAGIKALV
CPLMCLKKLNLYSKFKGRVQAASAHGRSPRVESRYLEYLFDVELAPPTLVLTQMEKILRVNATYQLPPCMPSLELKYQVE
FWKEGLGSKTLFPDTPYGQPVQIPLQQGASRRHCLSARTVYTLIDIKYSQFSEPSCIFLEAPGDKRAVLAMPSLLLLLIA
AVAAGVAWKIMKGNPWFQGVKTPRALDFSEYRYPVATFQPSGPEFSDDLILCPQKELTIRNRPAPQVRNPATLQAGPERD
STEDEDEDTDYDDDGDSVQPYLERPLFISEKPRVMEHSETDESGVDSGGPWTSPVGSDGSSAWDSSDRSWSSTGDSSYKD
EVGSSSCLDRKEPDQAPCGDWLQEALPCLEFSEDLGTVEEPLKDGLSGWRISGSLSSKRDLAPVEPPVSLQTLTFCWVNN
PEGEEEQEDEEEEEEEEEEEDWESEPKGSNAGCWGTSSVQRTEVRGRMLGDYLVR

I get a crash on my end with

src/spaln -Q7 -t1 -O0 -Thomosapi -dhs38.bkp protein.fa

I don't see the crash without -T.

ogotoh commented 2 years ago

Dear Heng,

Thank you very much for sending me the example that caused segmentation fault. I also reproduced that problem on my site. I will try to fix the problem as soon as possible.

By the way, no segmentation fault occurred with this sequence by using -yX2 option tuned for remote homologs. The result showed that spaln failed to correctly map the query on the genome, so that the obtained alignment was meaningless. Probably, the incorrect mapping is due to non-specific matches of repetitive elements. This hints a direction to which spaln should be improved. Namely, spaln should accept soft-masked sequences, where masked ones are used in the mapping phase whereas unmasked ones are used in the alignment phase. I want to incorporate this feature in a future release.

Osamu,

lh3 commented 2 years ago

Thank you for maintaining spaln these years. It is a great tool. I dig a little further about this protein ENSMUSP00000074009.7. It is Ifnlr1 according to the Ensembl annotation. Here is another possible alignment of this protein on GRCh38:

chr1 miniprot mRNA       24157130 24187248 1574 - . ID=MP000001;Identity=0.5933;Positive=0.7015;Target=ENSMUSP00000074009.7 1 535
chr1 miniprot CDS        24187191 24187248 0    - 0 Parent=MP000001;Target=ENSMUSP00000074009.7 1 19
chr1 miniprot CDS        24180731 24180854 0    - 2 Parent=MP000001;Target=ENSMUSP00000074009.7 20 60
chr1 miniprot CDS        24169417 24169598 0    - 1 Parent=MP000001;Target=ENSMUSP00000074009.7 61 121
chr1 miniprot CDS        24161542 24161684 0    - 2 Parent=MP000001;Target=ENSMUSP00000074009.7 122 169
chr1 miniprot CDS        24159474 24159633 0    - 0 Parent=MP000001;Target=ENSMUSP00000074009.7 170 222
chr1 miniprot CDS        24159052 24159182 0    - 2 Parent=MP000001;Target=ENSMUSP00000074009.7 223 266
chr1 miniprot CDS        24157133 24157891 0    - 0 Parent=MP000001;Target=ENSMUSP00000074009.7 267 535
chr1 miniprot stop_codon 24157130 24157132 0    - 0 Parent=MP000001

The identity is below 60%, which is uncommon for human-mouse orthologs. Nonetheless, this region contains the human IFNLR1 gene. The above might be a real alignment (EDIT: I see a non-canonical splicing. Not sure if that is correct).

ogotoh commented 2 years ago

Dear Heng,

I have just uploaded the new version spaln2.4.13. I replaced several files with corresponding ones under revision. The new version runs without a trouble with the sequence you provided, producing seemingly correct alignment. In this regard, I was wrong about the anticipation that the mapping phase might be problematic. I also tried to map/align all mouse refseq protein sequences of NP_* (excluding predicted ones) onto human genome. Although I do not yet evaluate the results, at least I met no clash during the execution.

I would like thank you again for your precious comments, and am very glad to have the opportunity of communicating with you.

Osamu,

lh3 commented 2 years ago

Dear Osamu,

Thank you for the fast response. I got another crash and was trying to identify the sequence. I think the following zebrafish protein is the culprit:

>ENSDARP00000108510.1
MIYLILLSAFSGAVVTLLLQLLLLYRRSPEPVARTVQYVKVVPDPALKDYFSSQQADSAPQQPDSPSPVSKQPEAASPKQ
QETPVPGSSPKQQPSSPPPPSLGDPQHSSKAETCDFLNAIILFLFRELRDTPVVRHWITKKIKVEFEELLQTKTAGRLLE
GLSLRDVSLGNSVPVFKTARLMKPVAVNEDNMPEELNFEVDLEYNGGFHLAIDVDLVFGKSAYLFVKMTRVAGRLKLQFT
RMPFTHWSFSFLEDPLIDFEVKSQFEGRPLPQLTSIIVNQLKRVIKRKHTLPNYKIRYKPFFPFQVQPPLMSSCDLDISI
RDTLLVEGRLRVTLVECSRLFILGSYERETYVHCTFELSSDEWREKTRSSIKETEVIKGPSGSVGMTFRHVPASDGDTVH
VSIETVTPNSPAALADLQRGDRLIAIGGVKVTSSVQVPKLLKQAGERVIVLYERPVRHHVPTGGLGMLQETLGPMEEPSY
LPQPGGYEEDPAPITTMDISENKDNDSEFEELNVESKTAAAPVTIDTKEDFLLSVNQSPKKTVANLAKPLGSISPILNRR
LNLQSPLKTQPKESPKPPTLKNAEPSEQPQRPTVPPPPPPARPPVPPRPHIKVTSASSEAQSLVEGNEPTVEKSPEKTQP
NTGNGEKTVEKIPVKPPEPKPVSKHPEPTEDILNIPATNKQDSAKDKISESSSNTRDSVDEQGLWESSETMYRNRTARWN
KASVIFEVESNHKFLNVALWCKNPFKLGSLLCLGHVSLRLEHLALECISTSSAEYQSTFRLCAPEPRASVSRTALRSLST
HKGFNEKLCYGDVTLNFTYLADGESDLSSGLTERERKGSLQEEDLKDREKEREQVLMVTRDEPIYSGMQIGEMRHNFQDT
QFQNPTYCEYCKKKVWTKAASQCMICSYVCHKKCQEKCLLEHPYCVAASDRRGADPEAKSTINRATTGLTRHIINTSSRL
LNLRQVPKARLAEQVADMGSGVVEPSPKHTPNTSDNESSDTETYTGASPSKQPAGSSGSKLVRKEGGLDDSVFIAVKEIG
RDLYRGLPTDERSQKLELMLDKLQQEIDQELEHNNSLSTEERDTIDSRRKTLITAALAKSGERLQALTLLMIHYRAGIED
LESVESTSPSEQHGFPKAKSEGLEEALMGTEVYDSDMCSPVDVQMLDEITEEQICVEALP

At my hand, if I run spaln on this sequence alone, I will not see a crash. However, if I mix it with a few other sequences, I will see a segmentation fault. addressSanitizer reported the following (I modified Makefile by adding -fsanitize=address -g and removing -O3):

hli@scorpius spaln-1$ src/spaln -Q7 -t1 -O0 -Thomosapi -dhs38.bkp 0-5.fa
=================================================================
==2737==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61700000005c at pc 0x000000585a4d bp 0x7fe94ab91350 sp 0x7fe94ab91348
WRITE of size 4 at 0x61700000005c thread T2
    #0 0x585a4c in Aln2h1::initH_ng(RVPD**, WINDOW const&, RANGE const*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:140
    #1 0x58a80b in Aln2h1::forwardH_ng(int*, WINDOW const&, bool, RANGE const*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:305
    #2 0x5ae4ec in Aln2h1::trcbkalignH_ng(WINDOW const&, bool, RANGE const*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:1964
    #3 0x5af578 in Aln2h1::lspH_ng(WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2034
    #4 0x5affd0 in Aln2h1::lspH_ng(WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2087
    #5 0x5c3611 in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2973
    #6 0x5c672f in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3118
    #7 0x5c26fb in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2947
    #8 0x5c672f in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3118
    #9 0x5c26fb in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2947
    #10 0x5c5edf in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3096
    #11 0x5c7035 in Aln2h1::globalH_ng(int*, WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3139
    #12 0x5c7739 in alignH_ng(Seq const**, PwdB const*, int*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3166
    #13 0x4cd0da in spalign2 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:647
    #14 0x4d00d3 in blkaln /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:829
    #15 0x4d3c27 in quick4 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1031
    #16 0x4d48d9 in spaln_job /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1072
    #17 0x4d7064 in worker_func /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1289
    #18 0x7fea2a8c1ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)
    #19 0x7fea29eceb0c in clone (/lib64/libc.so.6+0xfeb0c)

0x61700000005c is located 36 bytes to the left of 672-byte region [0x617000000080,0x617000000320)
allocated by thread T2 here:
    #0 0x4a465f in malloc (/hlilab/hli/miniprot/spaln-1/src/spaln+0x4a465f)
    #1 0x6a9e64 in operator new(unsigned long) (/hlilab/hli/miniprot/spaln-1/src/spaln+0x6a9e64)
    #2 0x5ae4ec in Aln2h1::trcbkalignH_ng(WINDOW const&, bool, RANGE const*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:1964
    #3 0x5af578 in Aln2h1::lspH_ng(WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2034
    #4 0x5affd0 in Aln2h1::lspH_ng(WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2087
    #5 0x5c3611 in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2973
    #6 0x5c672f in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3118
    #7 0x5c26fb in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2947
    #8 0x5c672f in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3118
    #9 0x5c26fb in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2947
    #10 0x5c5edf in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3096
    #11 0x5c7035 in Aln2h1::globalH_ng(int*, WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3139
    #12 0x5c7739 in alignH_ng(Seq const**, PwdB const*, int*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3166
    #13 0x4cd0da in spalign2 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:647
    #14 0x4d00d3 in blkaln /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:829
    #15 0x4d3c27 in quick4 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1031
    #16 0x4d48d9 in spaln_job /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1072
    #17 0x4d7064 in worker_func /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1289
    #18 0x7fea2a8c1ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)

Thread T2 created by T0 here:
    #0 0x450322 in pthread_create (/hlilab/hli/miniprot/spaln-1/src/spaln+0x450322)
    #1 0x4d820b in MasterWorker /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1341
    #2 0x4da18f in main /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1492
    #3 0x7fea29df2554 in __libc_start_main (/lib64/libc.so.6+0x22554)

Hope this helps.

ogotoh commented 2 years ago

Dear Heng,

As you suggested, ENSDARP00000108510.1 was actually responsible for the crash. Thank you again for your help. Additionally, I found another bug which can cause a similar trouble. I have just uploaded a revised version of spaln (spaln2.4.13a). Using all Ensembl zebrafish proteins as queries against human genome, this version, with or without -yX2 option, successfully finished. Although this does not guaranty the legitimacy of this implementation, the robustness of the program has certainly been improved. I welcome your further comments and suggestions.

Osamu,

lh3 commented 2 years ago

Dear Osamu,

Thanks a lot for the update. Spaln is indeed more robust. It now gives alignment on several settings that previously crashed. However, I found a new case that causes a segmentation fault:

>ENSDARP00000141901.2
TEKLLLKRLSSTIIKMAFIKEETEDLKIEQVFTLKREDHEEQTDLTLLKEEIQELNDVKEEEDPKAQNTPQKHFKRHYCG
RGFTEKRNLTVHSRVHTGETRFSCKKCGESFNKKDLFEKHKEIHLAVICRHCGRQFTQKYIKTHMRIHTGERPFRCGQCG
KSFAQRSTLDTHVITHTGERPYACSHCGNGFTTKASLDCHMRIHTGEKPFTCEQCGKSFSEKGSLTIHMRFHTGERPFVC
YQCGKGFVIKGNLDRHMIVHSGEKPYSCPQCGKGFKHKARIGVHMMIHSGEKPFACDQCGKSFSTKVHLESHKRVHLKDN
RVKCHQCGMSFPDGSQLKDHVQTHIGQKPFMCPECGRSCSKKPSLKIHMRSHAAEKPFTCKQCGKSYCVRGVLNVHMRIH
TGEKPYTCKQCGKSFLYQSDLKRHSKTHSGQED

The command line and the addressSanitizer report can be found in the following (adding -yX2 leads to the same crash):

$ ../spaln-1/src/spaln -Q7 -t1 -O0 -Thomosapi -dhs38.bkp -LS this-protein.fa
=================================================================
==196113==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x2aab82fd0abe at pc 0x00000058bedf bp 0x2aab8a71c420 sp 0x2aab8a71c418
READ of size 2 at 0x2aab82fd0abe thread T2
    #0 0x58bede in Aln2h1::forwardH_ng(int*, WINDOW const&, bool, RANGE const*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:349
    #1 0x5ae54a in Aln2h1::trcbkalignH_ng(WINDOW const&, bool, RANGE const*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:1962
    #2 0x5af5d6 in Aln2h1::lspH_ng(WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2034
    #3 0x5b002e in Aln2h1::lspH_ng(WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2088
    #4 0x5c366f in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2974
    #5 0x5c678d in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3119
    #6 0x5c2759 in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2948
    #7 0x5c678d in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3119
    #8 0x5c2759 in Aln2h1::interpolateH(unsigned int, int, JUXT const*, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:2948
    #9 0x5c5f3d in Aln2h1::seededH_ng(unsigned int, int, BOUND&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3097
    #10 0x5c7093 in Aln2h1::globalH_ng(int*, WINDOW const&) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3140
    #11 0x5c7797 in alignH_ng(Seq const**, PwdB const*, int*) /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:3167
    #12 0x4ccf74 in spalign2 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:641
    #13 0x4cff6d in blkaln /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:823
    #14 0x4d3ac1 in quick4 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1025
    #15 0x4d4773 in spaln_job /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1066
    #16 0x4d6efe in worker_func /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1283
    #17 0x2aaaaacd6ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)
    #18 0x2aaaab7058dc in __clone (/lib64/libc.so.6+0xfe8dc)

0x2aab82fd0abe is located 3394 bytes to the left of 8464442-byte region [0x2aab82fd1800,0x2aab837e403a)
allocated by thread T2 here:
    #0 0x4a465f in malloc (/hlilab/hli/miniprot/spaln-1/src/spaln+0x4a465f)
    #1 0x6a9ec4 in operator new(unsigned long) (/hlilab/hli/miniprot/spaln-1/src/spaln+0x6a9ec4)
    #2 0x4d405b in genomicseq /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1049
    #3 0x4ccf3e in spalign2 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:638
    #4 0x4cff6d in blkaln /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:823
    #5 0x4d3ac1 in quick4 /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1025
    #6 0x4d4773 in spaln_job /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1066
    #7 0x4d6efe in worker_func /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1283
    #8 0x2aaaaacd6ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4)

Thread T2 created by T0 here:
    #0 0x450322 in pthread_create (/hlilab/hli/miniprot/spaln-1/src/spaln+0x450322)
    #1 0x4d80a5 in MasterWorker /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1335
    #2 0x4da029 in main /homes6/hli/hli1/miniprot/spaln-1/src/spaln.cc:1486
    #3 0x2aaaab629554 in __libc_start_main (/lib64/libc.so.6+0x22554)

SUMMARY: AddressSanitizer: heap-buffer-overflow /homes6/hli/hli1/miniprot/spaln-1/src/fwd2h1.cc:349 in Aln2h1::forwardH_ng(int*, WINDOW const&, bool, RANGE const*)

Thank you,

Heng

ogotoh commented 2 years ago

Dear Heng,

The reason for the crash was due to an easy mistake not to copy the start site of the query in the local alignment. I also fixed several errors of the Hirschberg method for DNA queries. The fixed version is uploaded as spaln.2.4.13b.

Osamu,

lh3 commented 2 years ago

Dear Osamu,

On my end, it seems that v2.4.13b still segfaults on the same input sequence (ENSDARP00000141901.2 in my last post):

hli@node01 spaln$ ../spaln-new/src/spaln -Q7 -t1 -O0 -dhs38.bkp -LS 4-3.fa
Segmentation fault (core dumped)

Could you help to check if this happens on your machine?

Thanks,

Heng

ogotoh commented 2 years ago

Dear Heng,

Although spaln worked normally with the human genomic sequence of my custom use, it did fail with GCA_000001405.15_GRCh38_no_alt_analysis_set.fna. I identified that the segmentation fault was due to incomplete initialization of alignment variables. After correction, it produced following output, in which the fourth alignment was problematic.

$ spaln -Q7 -d homosa38_g -T homosapi -O0 -LS -M ENSDARP00000141901.2

gff-version 3

sequence-region chr19 43579393 44628992

chr19 ALN gene 44100807 44157711 980 + . ID=gene00001;Name=chr19_44129 chr19 ALN mRNA 44100807 44157711 980 + . ID=mRNA00001;Parent=gene00001;Name=chr19_44129 chr19 ALN cds 44100807 44100946 107 + 0 ID=cds00001;Parent=mRNA00001;Name=chr19_44129;Target=ENSDARP00000141901.2 9 55 + chr19 ALN cds 44106802 44107252 327 + 1 ID=cds00002;Parent=mRNA00001;Name=chr19_44129;Target=ENSDARP00000141901.2 56 201 + chr19 ALN cds 44131284 44131358 115 + 0 ID=cds00003;Parent=mRNA00001;Name=chr19_44129;Target=ENSDARP00000141901.2 202 226 + chr19 ALN cds 44157097 44157711 600 + 0 ID=cds00004;Parent=mRNA00001;Name=chr19_44129;Target=ENSDARP00000141901.2 227 431 +

sequence-region chr19 43579393 44628992

chr19 ALN gene 43719342 44107267 915 + . ID=gene00002;Name=chr19_43913 chr19 ALN mRNA 43719342 44107267 915 + . ID=mRNA00002;Parent=gene00002;Name=chr19_43913 chr19 ALN cds 43719342 43719512 86 + 0 ID=cds00005;Parent=mRNA00002;Name=chr19_43913;Target=ENSDARP00000141901.2 1 45 + chr19 ALN cds 43719717 43719794 56 + 0 ID=cds00006;Parent=mRNA00002;Name=chr19_43913;Target=ENSDARP00000141901.2 46 67 + chr19 ALN cds 43996527 43996946 329 + 0 ID=cds00007;Parent=mRNA00002;Name=chr19_43913;Target=ENSDARP00000141901.2 68 201 + chr19 ALN cds 44086145 44086399 280 + 0 ID=cds00008;Parent=mRNA00002;Name=chr19_43913;Target=ENSDARP00000141901.2 202 286 + chr19 ALN cds 44106830 44107267 405 + 0 ID=cds00009;Parent=mRNA00002;Name=chr19_43913;Target=ENSDARP00000141901.2 287 431 +

sequence-region chr19 56510465 58357760

chr19 ALN gene 56622134 58067960 953 + . ID=gene00003;Name=chr19_57345 chr19 ALN mRNA 56622134 58067960 953 + . ID=mRNA00003;Parent=gene00003;Name=chr19_57345 chr19 ALN cds 56622134 56622276 94 + 0 ID=cds00010;Parent=mRNA00003;Name=chr19_57345;Target=ENSDARP00000141901.2 1 49 + chr19 ALN cds 56622358 56622603 168 + 1 ID=cds00011;Parent=mRNA00003;Name=chr19_57345;Target=ENSDARP00000141901.2 50 130 + chr19 ALN cds 57420911 57421083 191 + 1 ID=cds00012;Parent=mRNA00003;Name=chr19_57345;Target=ENSDARP00000141901.2 131 185 + chr19 ALN cds 58067197 58067960 709 + 2 ID=cds00013;Parent=mRNA00003;Name=chr19_57345;Target=ENSDARP00000141901.2 186 432 +

sequence-region chr19 11713537 13099008

chr19 ALN gene 12014935 12319160 716 - . ID=gene00004;Name=chr19_12167 chr19 ALN mRNA 12014935 12319160 716 - . ID=mRNA00004;Parent=gene00004;Name=chr19_12167 chr19 ALN cds 12318669 12319160 364 - 0 ID=cds00014;Parent=mRNA00004;Name=chr19_12167;Target=ENSDARP00000141901.2 82 241 + chr19 ALN cds 12273300 12273822 455 - 0 ID=cds00015;Parent=mRNA00004;Name=chr19_12167;Target=ENSDARP00000141901.2 242 415 + chr19 ALN cds 12014935 12014990 37 - 2 ID=cds00016;Parent=mRNA00004;Name=chr19_12167;Target=ENSDARP00000141901.2 416 432 +

The corrected version has been uploaded as spaln.2.4.13c.

Osamu,

lh3 commented 2 years ago

Dear Osamu,

Thank you so much for the fix. v2.4.13c now can align all zebrafish proteins without segfault on my end as well. I will close this issue.

Heng