mcfrith / last-rna

MIT License
49 stars 6 forks source link

last-train error #5

Closed LipengKang closed 5 years ago

LipengKang commented 5 years ago

Hi, macfrith @mcfrith last is really nice split aligner. Now I want to align some contigs(N50=60kb) to genetically close references(4.8Gb). last-rna maybe a good choice. But when I follow recipies ,last-train always show last-train: error: [Errno 2] No such file or directory: '2'.
(repeat masked mydb and simple read identifiers was created successfully with last version 979)

Am I making some error? Can last-rna handle such task? or how about lastal ?

Thanks lipeng

mcfrith commented 5 years ago

Hi please could you show your exact commands for lastdb, last-train, etc? Have a nice day, Martin

LipengKang commented 5 years ago

Hi, lastdb -P8 -uNEAR -cR11 softmaskRef.fa awk '/>/ {$0 = ">" ++n} 1' myseq.fa > seq.fa last-train -P8 -Q0 mydb seq.fq > seq.par

Thanks, lipeng

mcfrith commented 5 years ago

I'm not sure if this is the full answer, but: The lastdb command should include the name that you give to the DB (like "mydb"). It looks like you're mixing up seq.fa and seq.fq. Have a nice day, Martin

LipengKang commented 5 years ago

Sorry for such few unexact information. Here I attach more.

  1. lastdb -P8 -uNEAR -cR11 mydb smRef.fa mydb was used in previous pairwise whole genome alignment by lastal. This works well before. smRef is softmasked genome (~4.9Gb)download form Ensembl.

  2. awk '/>/ {$0 = ">" ++n} 1' raw.ctg.fa > last.ctg.fa

    1 AAGTCAATTTTCAAACTCTCCCCCTCCCCTGGACGCTTCAGTTTAGAAAAAACAAAGAAATGATGGAAT 2 .......... last.ctg.fa is a fasta full of 99319 contigs(N50~70Kb) assembled by wtdbg2. And it changed simplified contig header as last-rna pointed.

  3. last-train -P8 mydb last.ctg.fa > last.ctg.par last-train: error: [Errno 2] No such file or directory: '2'. This error I encountered.

By the way all these commands are used with "nohup". query and target divergence is ~1Mya. In next test, my queries and targets sequence divergence may range from about 1-20Mya. So I am not sure if last-train can work well up to this time range. Or maybe i should use classic parameters like NEAR or MAM4 or MAM8. Any advice?

Thanks, lipeng

mcfrith commented 5 years ago

Mysterious... your commands look fine to me. It seems to be trying to read a non-existent file called "2". My guess: maybe you're running last-train via a shell script, which has "2" instead of "$2"?

last-train should work well for all those time ranges.

The choice of NEAR/MAM4/MAM8 is independent of last-train, actually. MAM4/8 have better sensitivity (for not-too-similar sequences), but use more memory and time. Without knowing much about your data, I guess MAM4/8 are overkill: I'd first try not using them.

Have a nice day, Martin

LipengKang commented 5 years ago

last-train works well now. I made a mistake when using "nohup".

Thank you, Martin

LipengKang commented 5 years ago

Hi, Martin Any way to parallel last-split? lastal brings out 5T alignment result. It cost much time for last-split. "parallel-fastq" seems not work.

thanks lipeng

mcfrith commented 5 years ago

I usually find that last-split is so much faster than lastal, that it's not necessary to parallelize it (except for spliced alignment).

I guess you want parallel-fasta, not fastq.

It might help to show your exact commands, with and without parallel-fast*

Have a nice day, Martin

LipengKang commented 5 years ago

My command like this. parallel-fasta "lastal mydb" < queries.fa > alns.maf last-split -m1 -fMAF alns.maf >split1.maf 'maf-swap split1.maf >1split.maf' 'last-split -m1 1split.maf> 2split.maf'

last-split is faster than lastal, during lastal step, parallel did help speed up alignment. And I get a big(5T) alignment file. it cost nearly 12 h. BUT the last three steps runned in one thread, it cost many days (almost a week)

SO I try parallel-fasta "last-split -m1 -fMAF" < alns.maf > split1.maf, But it only run in one thread.

Any advice?

lipeng

mcfrith commented 5 years ago

Oh, I see. You can't use parallel-fasta/q with a ".maf" input file. But you can do this:

parallel-fasta "lastal mydb | last-split -m1 -fMAF" < queries.fa > split1.maf

You can also give your last-train result to lastal, like here: https://github.com/mcfrith/last-genome-alignments

Have a nice day, Martin