mcfrith / last-genome-alignments

47 stars 5 forks source link

Lastal aligner is not detecting query sequences. #10

Open Rob-murphys opened 3 years ago

Rob-murphys commented 3 years ago

I am using the LAST aligner to align a reference to a denvo assembled genome to identify contamination. I have run lastdb on the reference to create the database but when using lastal I get no significant output:

lastal $db_name $denovo_assembly > $outfile

# LAST version 1060
    #
    # a=21 b=9 A=21 B=9 e=139 d=102 x=138 y=44 z=138 D=1e+06 E=6941.95
    # R=01 u=0 s=2 S=0 M=0 T=0 m=10 l=1 n=10 k=1 w=1000 t=4.36661 j=3 Q=0
    # termitodb
    # Reference sequences=47 normal letters=70034757
    # lambda=0.228526 K=0.433378
    #
    #     A   C   G   T   M   S   K   W   R   Y   B   D   H   V
    # A   6 -18 -18 -18   3 -18 -18   3   3 -18 -18   1   1   1
    # C -18   6 -18 -18   3   3 -18 -18 -18   3   1 -18   1   1
    # G -18 -18   6 -18 -18   3   3 -18   3 -18   1   1 -18   1
    # T -18 -18 -18   6 -18 -18   3   3 -18   3   1   1   1 -18
    # M   3   3 -18 -18   3   0 -18   0   0   0  -2  -2   1   1
    # S -18   3   3 -18   0   3   0 -18   0   0   1  -2  -2   1
    # K -18 -18   3   3 -18   0   3   0   0   0   1   1  -2  -2
    # W   3 -18 -18   3   0 -18   0   3   0   0  -2   1   1  -2
    # R   3 -18   3 -18   0   0   0   0   3 -18  -2   1  -2   1
    # Y -18   3 -18   3   0   0   0   0 -18   3   1  -2   1  -2
    # B -18   1   1   1  -2   1   1  -2  -2   1   1  -1  -1  -1
    # D   1 -18   1   1  -2  -2   1   1   1  -2  -1   1  -1  -1
    # H   1   1 -18   1   1  -2  -2   1  -2   1  -1  -1   1  -1
    # V   1   1   1 -18   1   1  -2  -2   1  -2  -1  -1  -1   1
    #
    # Coordinates are 0-based.  For - strand matches, coordinates
    # in the reverse complement of the 2nd sequence are used.
    #
    # name start alnSize strand seqSize alignment
    #
    # Query sequences=0

It seems as though no query sequences are being detected but the de novo assemblies are in the correct fasta format eg:

  >scaffold03054
GACATTCCTTCCCATGTTGCCCATGAAGTTACCGATCCCTCCACTAATTTTCTCTGGGACGAGATCAACACCACCAACCAAGCCTACTCCAAGCATGCCAATGCCCAATGCAAACCAACTCCTAACTGGCCCCCCAGCACCTTAGTCTGGCTTGATCAATGAATCTCAAGACCTAGAGGCCCTCTATCAAGCTTGAACACAAACAACTCAGCCCCTTCAAGATTCTCTAAAAAGTCTCAATGCATGCTTACAA
     >scaffold03288
     GCGGCGTGGAGAAAGACTTTGGAGAGATAAACTTGAAGAAAGTACAAAAATGATCTAGGAGAATTATGTAAGTGCCATGTGGATTTGGAGCGATTTTGGCTGGGGAACCATGCATTCGAGCGGACATAACCCTTCTCTCATTACTTATGTAAGAGAGTCGGGGATGTTGAGGTCGAAACAACCAACACACAACAGAACTG TAGCATCTCCACACTATTAGAAGTGGCCTCTTGAGAA
mcfrith commented 3 years ago

My best guess is: $denovo_assembly doesn't have the right file name, or something like that.

Indeed your sequence format looks fine, yet Query sequences=0: impossible!

Rob-murphys commented 3 years ago

An example command from the bash script would be

cd /home/lamma/local-blast/termitomycesBGI/Genome_v3/termitodb # to get me into the database directory as it didn't seem to work if I was not there.
lastal -P4 -E 0.5 -C 2 termitodb /home/lamma/local-blast/termitomycesBGI/redundans-output_hybrid/IC35-2/scaffolds.reduced.fa > /home/lamma/local-blast/termitomycesBGI/redundans-output_hybrid/IC35-2_termitodb.maf

The assembly filepath is indeed correct:

less /home/lamma/local-blast/termitomycesBGI/redundans-output_hybrid/IC35-2/scaffolds.reduced.fa

>scaffold09232
GGCGAGAATGTCGGTTTGCCCGACGGACAGATGGGTAACTCGGAAGTAGGTCACCTCAATATCGGTGCAGGACGCATCGTATACCAGGACTTGGTAAAGA
TTAACCGTGCTTGTGCCGACGGCAGCATCCTGAAGAATAAGGAAATCGTTTCCGCTTTCTCTTACGCGAAGGAGCACGGTAAGAACATCCATTTTATGGG
ACTCACCAGCAATGGCGGTGTGCACAGTTCTTTCGACCATCTGTTCAAGCTTTGCGATATTTCAAAGGAAT
>scaffold09237
GGTGTGCGCATGAATGATCTGGCCAATTCCACCATGACGACGACTTCTCCCCCCAAGCGCATTGACTTCAATACGCCTGAGCTGCAACGCAAGCGCCGCA
TTCGCGCGCTCAAGGATCGCCTGACCCGTTGGTACGTGCTAGTCGGCGGCCCCGCCGTCCTCGGCGCCATCACCCTGATTTTCTTCTTTCTTGCCTATGT
CGTCGCGCCGCTGTTTCAAGGTGCCTCATTGACCAAGGAAGACGCCCTGACCCCGGCCTGGATCCAGGACG
>scaffold09236
GCGTAGCCCGAGAAAACTTTGGCTGCTCGTCCGGTTGGAGCCAACAAAAACGTTTTATGCTTTAGTTCGTTAAGCGTTTTCACCAAAGCCCCGACCAGAG
AGGTTTTTCCGGTTCCTGCGTAGCCTTTCATCAGTAGAAGGCTGTCCCGTTCCTCAGCCAAGAGAAATGCGGTTAGCGTATGAAGCGCTAGAATGAGAAC
ATCCGTCGGGTTATAGAGGAAATTTCGTACGAATTGCTGGCCGAAGTAACTATTTATGATTTTGACCGCAA
mcfrith commented 3 years ago

Hi, I copy-pasted your 3 sequences and lastal options. I got the expected Query sequences=3. Could you send some minimal sequence files and exact options (including for lastdb) that exhibit the problem?

Rob-murphys commented 3 years ago

hi, where should I upload the sequence files to?

mcfrith commented 3 years ago

Apparently, you can attach files to these comments: https://docs.github.com/en/free-pro-team@latest/github/managing-your-work-on-github/file-attachments-on-issues-and-pull-requests

Or you could email them to my address shown in here: https://sites.google.com/site/mcfrith/

Rob-murphys commented 3 years ago

Hi there, sorry for the slow reply.

I have attached an example sequence file. P1_19_mod.zip

All commands used are below

last-train:

cd termitodb

lastdb -uNEAR -R01 termitodb ../Termitomyces_v3.0.fa

lastal

cd $db_path

lastal -P4 -E 0.5 -C 2 $db_name $denovo_assembly > $outfile

echo $db_name
termitodb

echo $denovo_assembly
/home/lamma/local-blast/termitomycesBGI/redundans-output_ragtag_hybrid/P1_19_mod.fa

echo $outfile
/home/lamma/local-blast/termitomycesBGI/redundans-output_ragtag_hybrid/P1_19_mod.maf
mcfrith commented 3 years ago

Hi:

I tried your commands and sequences, with LAST version 1060, and got the expected # Query sequences=1558. (I don't have Termitomyces_v3.0.fa, so I used some random file instead, but I'm 99.99% sure that can't explain it.)

So I'm not sure what to suggest!

Are you using some job queuing/submission system? What if you run it in the simplest possible way, on your own local computer, without any $ variables?

Rob-murphys commented 3 years ago

I am using a job submission system yes. This is all running in slurm. What would the job submission system be doing to cause this error?

I can try run it locally 👍