Question about crude allele prediction

HLHsieh commented 6 months ago

Hi,

I greatly appreciate this amazing tool. I am currently attempting allele prediction using the -o2 option. I am curious about how this tool predicts alleles and how to interpret the output.

Here are my results:

chrX    147912050       147912110       CGG     FMR1    5'UTR   91      95      .       91:SRR15130017.5400,95:SRR15130017.258,97:SRR15130017.37790

In this example, there are two alleles with repeat of 91 and 95. I am wondering which allele the read with a repeat length of 97 will be assigned to.

chrX    147912050       147912110       CGG     FMR1    5'UTR   97      100     92:SRR15130032.176659,97:SRR15130032.141681,98:SRR15130032.4378,100:SRR15130032.153319,100:SRR15130032.196143        95:SRR15130032.179470,96:SRR15130032.65854,97:SRR15130032.175600,99:SRR15130032.152398,99:SRR15130032.202720,100:SRR15130032.185264,102:SRR15130032.120751

Similarly, in this example, I would like to know if this tool assigns the read to a specific allele.

Thank you.

mcfrith commented 6 months ago

Thanks for your interest in tandem-genotypes. Of course I've forgotten how it predicts alleles :confused:, but a code comment says it uses k-medoids clustering with k=2.

If there are just 3 reads with different repeat counts, I think it's impossible to predict 2 alleles reliably. It just picks 2 of them in that case.

tandem-genotypes-merge assigns reads to alleles. It simply assigns each read to the allele with the closest repeat count.

HLHsieh commented 6 months ago

Thank you for your quick response. I was trying to run tandem-genotypes-merge as follows:

last-train -P8 -Q0 --postmask=0 ${mydb_dir}/mydb ${sample}.fastq > ${mydb_dir}/${sample}.par

lastal -P8 --split -p ${mydb_dir}/${sample}.par ${mydb_dir}/mydb ${sample}.fastq > ${mydb_dir}/${sample}.maf

tandem-genotypes -v -o2 --postmask=0 -g ${ref_dir}/refGene.txt $predefined ${mydb_dir}/${sample}.maf > tandem-genotypes/${sample}_allele.txt

tandem-genotypes-merge ${input_dir}/${sample}.fastq ${mydb_dir}/${sample}.maf tandem-genotypes/${sample}_allele.txt > tandem-genotypes/${sample}.merged.fa

The error message shows:

lamassemble: error: can't read the last-train data
lamassemble: error: can't read the last-train data

I have no idea why the last-train data can not be read. Any advice would be appreciated.

Best, Hsin

mcfrith commented 6 months ago

You need to give tandem-genotypes-merge your .par instead of .maf.

HLHsieh commented 6 months ago

Oh, sorry that I did not notice this.

After correcting to

tandem-genotypes-merge ${input_dir}/${sample}.fastq ${mydb_dir}/${sample}.par tandem-genotypes/${sample}_allele.txt > tandem-genotypes/${sample}.merged.fa

There is still error message:

lastal: can't interpret: 30g
Traceback (most recent call last):
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 756, in <module>
    main(opts, *args)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 701, in main
    assemble(opts, scale, probMatrix, gapProbs, sequences, tmpdir)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 595, in assemble
    alignments = pairwiseAlignments(opts, *tmpFileNames)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 571, in pairwiseAlignments
    addStrandAlignments(alignments, opts, db, seqFile, fwdMat, 1)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 567, in addStrandAlignments
    raise subprocess.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('lastal', '-s1', '-j4', '-D1e9', '-P1', '-m5', '-z30g', '-p', '/tmp/lamassembleC8MgEn/fwd.mat', '/tmp/lamassembleC8MgEn/db', '/tmp/lamassembleC8MgEn/x.fa')' returned non-zero exit status 1
lastal: can't interpret: 30g
Traceback (most recent call last):
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 756, in <module>
    main(opts, *args)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 701, in main
    assemble(opts, scale, probMatrix, gapProbs, sequences, tmpdir)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 595, in assemble
    alignments = pairwiseAlignments(opts, *tmpFileNames)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 571, in pairwiseAlignments
    addStrandAlignments(alignments, opts, db, seqFile, fwdMat, 1)
  File "/home/hsinlun/miniconda3/envs/poretools/bin/lamassemble", line 567, in addStrandAlignments
    raise subprocess.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '('lastal', '-s1', '-j4', '-D1e9', '-P1', '-m5', '-z30g', '-p', '/tmp/lamassemble8Ocppx/fwd.mat', '/tmp/lamassemble8Ocppx/db', '/tmp/lamassemble8Ocppx/x.fa')' returned non-zero exit status 1

mcfrith commented 6 months ago

I think it's because you have an old version of LAST. Try lastal --version. It should be >= 935 according to https://gitlab.com/mcfrith/lamassemble

mcfrith / tandem-genotypes

Question about crude allele prediction #22