About the runtime - Githubissues

LiShuhang-gif commented 3 years ago

Hello, I'm using last-train to determine the rates of insertion, deletion, and substitutions between my reads and the genome:

last-train -P12 -Q0 mydb 17HanZZ0034.fastq > myseq.par

Unfortunately, the program has been running for 70 hours and is still running now. Although I chose to prepare the genome without Repeat - Masking, the current run time seems too long. I want to know if this running time is normal or abnormal. And if it's abnormal, what can I do to speed up. By the way, the size of my fastq file(17HanZZ0034.fastq) is 72G. Thanks a lot!

mcfrith commented 3 years ago

I'm surprised it's that slow. If you can attach your (incomplete) myseq.par, it might show some clues. The 72G shouldn't matter, because last-train uses a fixed-size sample of the query sequences. What genome are you using?

LiShuhang-gif commented 3 years ago

I'm using the human genome. The above step has been completed in nearly 78 hours. And now I'm using lastal to align DNA reads to their orthologous bases in the genome. Here is my command:

lastal -P 28 -p myseq.par mydb 17HanZZ0034.fastq | last-split > myseq.maf

So far, this step has taken 40 hours and is still running now, which still takes a lot of time I think. I'll put the complete myseq.par file here. Please let me know if you have any suggestion to solve this problem and help me to speed up it. Thanks.

# lastal version: 1256
# maximum percent identity: 100
# scale of score parameters: 4.5512
# scale used while training: 91.024

# lastal -j7 -S1 -P12 -Q0 -r5 -q5 -a15 -b3

# aligned letter pairs: 910388.1
# deletes: 59366.28
# inserts: 34921.5338
# delOpens: 34326.18
# insOpens: 22506.0228
# alignments: 558
# mean delete size: 1.72948
# mean insert size: 1.55165
# matchProb: 0.940698
# delOpenProb: 0.035469
# insOpenProb: 0.0232553
# delExtendProb: 0.42179
# insExtendProb: 0.355526

# substitution percent identity: 92.3214

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 242472         4147.0099      13431.5314     5116.1765     
# C 4379.9383      173498.66      2847.70253     5875.3163     
# G 14191.693      2485.130373    167794.16      3758.22718    
# T 4795.61218     5336.880759    3540.324008    256724.7      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.266337       0.00455518     0.0147535      0.00561973    
# C 0.00481103     0.190575       0.00312799     0.00645359    
# G 0.0155885      0.00272973     0.184309       0.00412813    
# T 0.00526762     0.00586216     0.00388878     0.281993      

# delExistCost: 280
# insExistCost: 292
# delExtendCost: 74
# insExtendCost: 90

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     98   -239   -133   -255
# C   -235    133   -242   -210
# G   -129   -255    128   -252
# T   -260   -218   -256    100

# lastal -j7 -S1 -P12 -Q0 -t91.0723 -p-

# aligned letter pairs: 893238.2
# deletes: 63016.81
# inserts: 40230.9481
# delOpens: 38189.95
# insOpens: 25792.2964
# alignments: 578
# mean delete size: 1.65009
# mean insert size: 1.5598
# matchProb: 0.932594
# delOpenProb: 0.0398726
# insOpenProb: 0.0269287
# delExtendProb: 0.393972
# insExtendProb: 0.358894

# substitution percent identity: 94.7581

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 242609.71      1927.340774    12907.436195   2254.84882    
# C 2184.642908    175679.33      1155.5313591   3631.06656    
# G 13769.9899     942.8457702    169549.65      1537.011703   
# T 2054.001752    3118.2846542   1344.2370255   258653.7      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.271582       0.0021575      0.0144488      0.00252412    
# C 0.00244553     0.196659       0.00129353     0.00406469    
# G 0.0154144      0.00105544     0.189797       0.00172056    
# T 0.00229929     0.00349067     0.00150477     0.289542      

# delExistCost: 260
# insExistCost: 280
# delExtendCost: 79
# insExtendCost: 89

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A    100   -308   -136   -328
# C   -297    135   -324   -253
# G   -131   -342    129   -332
# T   -337   -266   -344    102

# lastal -j7 -S1 -P12 -Q0 -t91.3324 -p-

# aligned letter pairs: 886784.7
# deletes: 67261.25
# inserts: 44839.3056
# delOpens: 41196.03
# insOpens: 28310.4247
# alignments: 582
# mean delete size: 1.63271
# mean insert size: 1.58384
# matchProb: 0.926752
# delOpenProb: 0.0430527
# insOpenProb: 0.0295864
# delExtendProb: 0.387522
# insExtendProb: 0.368625

# substitution percent identity: 95.4855

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 242742.9       1294.178206    12792.055635   1426.286513   
# C 1548.539595    175712.03      675.8587138    2870.859088   
# G 13583.8445     542.2332017    169457.2       925.261326    
# T 1264.153482    2365.3611071   746.05339484   258854.4      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.273729       0.00145938     0.0144249      0.00160835    
# C 0.00174621     0.198141       0.000762131    0.00323732    
# G 0.0153178      0.000611448    0.191088       0.00104337    
# T 0.00142552     0.0026673      0.000841286    0.291897      

# delExistCost: 251
# insExistCost: 276
# delExtendCost: 80
# insExtendCost: 86

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -344   -137   -370
# C   -328    136   -372   -274
# G   -132   -392    129   -379
# T   -381   -291   -397    102

# lastal -j7 -S1 -P12 -Q0 -t91.0781 -p-

# aligned letter pairs: 885006.1
# deletes: 69713.81
# inserts: 47438.5439
# delOpens: 42799.94
# insOpens: 29606.6777
# alignments: 590
# mean delete size: 1.62883
# mean insert size: 1.60229
# matchProb: 0.923802
# delOpenProb: 0.0446762
# insOpenProb: 0.0309046
# delExtendProb: 0.386062
# insExtendProb: 0.375894

# substitution percent identity: 95.8208

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243073.3       1023.549276    12681.13977    1056.7107291  
# C 1284.40529     175990.55      482.3514595    2517.854682   
# G 13448.37729    384.94506597   169807.11      667.2841279   
# T 928.1245674    2001.6120012   513.58975493   259238.9      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.274628       0.00115642     0.0143274      0.00119389    
# C 0.00145114     0.198837       0.000544968    0.00284471    
# G 0.0151942      0.000434917    0.191851       0.000753908   
# T 0.00104861     0.00226145     0.000580262    0.292892      

# delExistCost: 247
# insExistCost: 275
# delExtendCost: 80
# insExtendCost: 84

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -365   -138   -397
# C   -345    136   -403   -286
# G   -133   -424    129   -409
# T   -409   -306   -432    102

# lastal -j7 -S1 -P12 -Q0 -t91.0806 -p-

# aligned letter pairs: 884838.6
# deletes: 71372.83
# inserts: 49059.9165
# delOpens: 43780.62
# insOpens: 30388.6841
# alignments: 592
# mean delete size: 1.63024
# mean insert size: 1.61441
# matchProb: 0.92209
# delOpenProb: 0.0456238
# insOpenProb: 0.031668
# delExtendProb: 0.386593
# insExtendProb: 0.38058

# substitution percent identity: 95.9969

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243524.9       891.127115     12594.69529    869.2600543   
# C 1161.369069    176279.56      391.22735866   2338.867059   
# G 13354.0229     310.63178636   170057.83      541.6494231   
# T 758.1233217    1812.1867268   400.83204362   259622.4      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275198       0.00100703     0.0142328      0.000982316   
# C 0.00131242     0.199206       0.00044211     0.00264306    
# G 0.0150908      0.000351033    0.192176       0.000612096   
# T 0.000856725    0.00204788     0.000452964    0.293389      

# delExistCost: 245
# insExistCost: 275
# delExtendCost: 80
# insExtendCost: 83

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -378   -139   -415
# C   -355    136   -422   -292
# G   -134   -443    129   -428
# T   -428   -315   -454    102

# lastal -j7 -S1 -P12 -Q0 -t91.0505 -p-

# aligned letter pairs: 884323.3
# deletes: 72203.83
# inserts: 49920.0574
# delOpens: 44280
# insOpens: 30773.555
# alignments: 594
# mean delete size: 1.63062
# mean insert size: 1.62217
# matchProb: 0.921197
# delOpenProb: 0.0461264
# insOpenProb: 0.0320567
# delExtendProb: 0.386736
# insExtendProb: 0.383543

# substitution percent identity: 96.0951

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243644.2       819.307969     12529.710094   763.011848    
# C 1092.851629    176362.47      342.63519497   2249.897176   
# G 13282.2126     272.90354627   170149.03      473.9446493   
# T 660.3587931    1707.1879268   341.41887092   259729.5      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275485       0.000926378    0.0141671      0.000862725   
# C 0.00123567     0.19941        0.000387412    0.00254392    
# G 0.015018       0.000308568    0.192385       0.000535881   
# T 0.000746657    0.00193029     0.000386037    0.293672      

# delExistCost: 245
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 83

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -386   -139   -427
# C   -360    136   -434   -296
# G   -135   -455    129   -440
# T   -440   -320   -469    102

# lastal -j7 -S1 -P12 -Q0 -t90.9734 -p-

# aligned letter pairs: 883486.1
# deletes: 72534.03
# inserts: 50281.1878
# delOpens: 44420.63
# insOpens: 30980.8254
# alignments: 594
# mean delete size: 1.63289
# mean insert size: 1.62298
# matchProb: 0.920794
# delOpenProb: 0.0462964
# insOpenProb: 0.0322891
# delExtendProb: 0.387589
# insExtendProb: 0.383849

# substitution percent identity: 96.1462

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243508.3       779.049721     12505.212987   701.6762278   
# C 1061.763832    176285.46      315.34366977   2197.380885   
# G 13234.85637    252.33082239   170099.03      436.8713964   
# T 605.6020096    1653.2986672   306.19728647   259598.1      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275605       0.000881736    0.0141535      0.000794164   
# C 0.00120171     0.199522       0.000356909    0.00248702    
# G 0.0149793      0.000285591    0.19252        0.000494455   
# T 0.000685426    0.00187122     0.000346557    0.293816      

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 83

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -390   -140   -435
# C   -363    136   -442   -298
# G   -135   -462    129   -447
# T   -448   -323   -479    102

# lastal -j7 -S1 -P12 -Q0 -t90.9616 -p-

# aligned letter pairs: 883227.9
# deletes: 72817.92
# inserts: 50544.8579
# delOpens: 44605.62
# insOpens: 31109.7555
# alignments: 594
# mean delete size: 1.63248
# mean insert size: 1.62473
# matchProb: 0.920472
# delOpenProb: 0.0464865
# insOpenProb: 0.0324216
# delExtendProb: 0.387436
# insExtendProb: 0.384512

# substitution percent identity: 96.1781

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243513         759.394831     12469.516286   662.7568431   
# C 1043.019864    176281.82      298.81464313   2171.479071   
# G 13218.19326    240.30829139   170109.03      415.5769021   
# T 571.3877196    1622.0672939   284.77885023   259592.6      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.2757         0.00085977     0.0141177      0.000750358   
# C 0.00118088     0.199582       0.000338311    0.0024585     
# G 0.0149653      0.000272072    0.192594       0.000470507   
# T 0.000646912    0.00183647     0.00032242     0.293905      

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -392   -140   -440
# C   -364    136   -447   -299
# G   -135   -467    129   -452
# T   -453   -325   -485    102

# lastal -j7 -S1 -P12 -Q0 -t91.0048 -p-

# aligned letter pairs: 882970.8
# deletes: 73078.79
# inserts: 50823.7388
# delOpens: 44724.84
# insOpens: 31230.4763
# alignments: 594
# mean delete size: 1.63396
# mean insert size: 1.62738
# matchProb: 0.92022
# delOpenProb: 0.0466116
# insOpenProb: 0.032548
# delExtendProb: 0.387992
# insExtendProb: 0.385514

# substitution percent identity: 96.2011

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243502.4       746.324946     12450.459489   636.1814449   
# C 1032.6109      176271.1       287.44690141   2149.651302   
# G 13196.64427    231.2572848    170098.63      399.4260808   
# T 547.8090392    1595.0326348   271.12648913   259582.9      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275767       0.000845216    0.0141002      0.000720478   
# C 0.00116944     0.199628       0.000325535    0.00243449    
# G 0.0149453      0.0002619      0.192637       0.000452352   
# T 0.000620396    0.00180638     0.000307052    0.293979      

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -394   -140   -444
# C   -365    136   -450   -300
# G   -135   -470    129   -455
# T   -457   -327   -490    102

# lastal -j7 -S1 -P12 -Q0 -t90.9797 -p-

# aligned letter pairs: 882890.8
# deletes: 73177.41
# inserts: 50918.2687
# delOpens: 44769.16
# insOpens: 31273.6662
# alignments: 594
# mean delete size: 1.63455
# mean insert size: 1.62815
# matchProb: 0.92013
# delOpenProb: 0.0466575
# insOpenProb: 0.0325927
# delExtendProb: 0.388211
# insExtendProb: 0.385807

# substitution percent identity: 96.2125

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243502         737.721451     12449.957388   618.8685265   
# C 1027.271887    176266.77      281.71500221   2138.53712    
# G 13195.38323    226.85896076   170098.03      391.3771389   
# T 532.6066046    1577.8669476   261.49345601   259581.3      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275802       0.000835578    0.0141014      0.000700959   
# C 0.00116354     0.199648       0.000319084    0.00242221    
# G 0.0149457      0.000256951    0.192661       0.000443292   
# T 0.000603255    0.00178717     0.00029618     0.294014      

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -395   -140   -446
# C   -366    136   -452   -301
# G   -135   -472    129   -457
# T   -460   -327   -493    102

# lastal -j7 -S1 -P12 -Q0 -t90.9658 -p-

# aligned letter pairs: 882773.8
# deletes: 73202.45
# inserts: 50967.3986
# delOpens: 44777.48
# insOpens: 31299.1961
# alignments: 595
# mean delete size: 1.63481
# mean insert size: 1.62839
# matchProb: 0.920087
# delOpenProb: 0.0466701
# insOpenProb: 0.0326221
# delExtendProb: 0.388306
# insExtendProb: 0.385898

# substitution percent identity: 96.2191

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243492         732.974624     12447.916387   610.3701628   
# C 1021.845502    176262.95      277.94192676   2127.732562   
# G 13195.02321    223.73471847   170091.83      386.1160161   
# T 521.1742646    1577.2818045   255.73610094   259569.9      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.27582        0.000830289    0.0141006      0.000691407   
# C 0.00115751     0.199665       0.000314843    0.00241022    
# G 0.0149469      0.000253439    0.192674       0.000437379   
# T 0.000590369    0.00178669     0.000289689    0.294032      

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -396   -140   -447
# C   -366    136   -453   -301
# G   -135   -473    129   -458
# T   -462   -328   -495    102

# lastal -j7 -S1 -P12 -Q0 -t90.9579 -p-

# aligned letter pairs: 882733.8
# deletes: 73263.4
# inserts: 51001.0686
# delOpens: 44803.02
# insOpens: 31316.3961
# alignments: 594
# mean delete size: 1.63523
# mean insert size: 1.62857
# matchProb: 0.920043
# delOpenProb: 0.0466967
# insOpenProb: 0.03264
# delExtendProb: 0.388467
# insExtendProb: 0.385966

# substitution percent identity: 96.2224

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243498.7       728.900857     12450.525286   606.1124315   
# C 1022.093829    176265.34      276.13028846   2127.508449   
# G 13195.8532     222.44561863   170094.63      383.4949563   
# T 514.0464249    1568.8933984   252.30576189   259579.6      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.27583        0.000825682    0.0141037      0.00068659    
# C 0.0011578      0.199669       0.000312794    0.00240999    
# G 0.014948       0.000251981    0.192679       0.000434414   
# T 0.0005823      0.00177721     0.000285806    0.294046      

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -396   -140   -448
# C   -366    136   -454   -301
# G   -135   -474    129   -459
# T   -463   -328   -497    102

# lastal -j7 -S1 -P12 -Q0 -t90.9542 -p-

# aligned letter pairs: 882731.8
# deletes: 73283.13
# inserts: 51020.1286
# delOpens: 44809.75
# insOpens: 31325.6361
# alignments: 594
# mean delete size: 1.63543
# mean insert size: 1.6287
# matchProb: 0.920028
# delOpenProb: 0.046703
# insOpenProb: 0.0326492
# delExtendProb: 0.388539
# insExtendProb: 0.386014

# substitution percent identity: 96.2241

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243499.5       728.929657     12451.305286   601.9976592   
# C 1022.202506    176264.94      274.27387431   2127.528348   
# G 13196.42319    220.96275348   170093.63      380.9513207   
# T 510.4233786    1568.7967974   248.73720285   259578.5      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275836       0.000825731    0.0141048      0.000681942   
# C 0.00115795     0.199673       0.000310697    0.00241006    
# G 0.0149489      0.000250306    0.192682       0.000431541   
# T 0.000578207    0.00177713     0.000281769    0.29405       

# delExistCost: 244
# insExistCost: 274
# delExtendCost: 80
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -396   -140   -449
# C   -366    136   -455   -301
# G   -135   -474    129   -460
# T   -464   -328   -498    102

# lastal -j7 -S1 -P12 -Q0 -t90.9512 -p-

# aligned letter pairs: 883620.8
# deletes: 73538.02
# inserts: 51113.6886
# delOpens: 44920.56
# insOpens: 31380.6561
# alignments: 595
# mean delete size: 1.63707
# mean insert size: 1.62883
# matchProb: 0.919942
# delOpenProb: 0.046767
# insOpenProb: 0.0326706
# delExtendProb: 0.389152
# insExtendProb: 0.386062

# substitution percent identity: 96.2214

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243559.2       731.154747     12465.885286   597.9639979   
# C 1024.870416    176618.93      273.09495025   2136.903337   
# G 13224.68318    222.89115718   170428.43      379.1039278   
# T 508.1372043    1576.8193974   248.13743083   259657.8      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275627       0.000827422    0.0141072      0.000676695   
# C 0.00115981     0.199873       0.000309052    0.00241826    
# G 0.0149659      0.000252238    0.192868       0.000429019   
# T 0.000575041    0.00178443     0.000280808    0.293846      

# delExistCost: 245
# insExistCost: 274
# delExtendCost: 79
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -396   -140   -449
# C   -366    136   -455   -301
# G   -135   -474    128   -460
# T   -464   -328   -498    102

# lastal -j7 -S1 -P12 -Q0 -t90.7452 -p-

# aligned letter pairs: 883582.8
# deletes: 73586.69
# inserts: 51156.688
# delOpens: 44827.53
# insOpens: 31367.7955
# alignments: 595
# mean delete size: 1.64155
# mean insert size: 1.63087
# matchProb: 0.92004
# delOpenProb: 0.0466772
# insOpenProb: 0.0326621
# delExtendProb: 0.39082
# insExtendProb: 0.386829

# substitution percent identity: 96.2246

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243560.1       728.384735     12465.855382   594.8409534   
# C 1021.714178    176616.77      272.31061313   2128.647094   
# G 13222.94618    222.34998721   170407.13      377.8824812   
# T 505.4494746    1571.3704051   247.29998078   259657.4      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.275645       0.000824337    0.014108       0.000673201   
# C 0.00115631     0.199883       0.000308183    0.00240906    
# G 0.0149648      0.000251641    0.192855       0.000427662   
# T 0.000572034    0.00177837     0.000279878    0.293863      

# delExistCost: 245
# insExistCost: 274
# delExtendCost: 79
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -396   -140   -450
# C   -366    136   -456   -301
# G   -135   -474    128   -461
# T   -465   -328   -498    102

# lastal -j7 -S1 -P12 -Q0 -t90.7427 -p-

# aligned letter pairs: 883560.7
# deletes: 73598.7
# inserts: 51169.328
# delOpens: 44832.83
# insOpens: 31373.6055
# alignments: 595
# mean delete size: 1.64163
# mean insert size: 1.63097
# matchProb: 0.920028
# delOpenProb: 0.0466832
# insOpenProb: 0.0326685
# delExtendProb: 0.390848
# insExtendProb: 0.386867

# substitution percent identity: 96.2259

# count matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 243560.1       728.414824     12465.844382   590.8640321   
# C 1021.732958    176615.66      270.45205909   2128.725194   
# G 13223.04618    222.31203211   170405.93      375.4225772   
# T 501.8593603    1571.2857741   247.34986978   259656.3      

# probability matrix (query letters = columns, reference letters = rows):
#   A              C              G              T             
# A 0.27565        0.000824385    0.0141083      0.000668712   
# C 0.00115635     0.199885       0.000306085    0.00240919    
# G 0.0149652      0.000251602    0.192857       0.000424885   
# T 0.000567981    0.00177831     0.000279939    0.293867      

# delExistCost: 245
# insExistCost: 274
# delExtendCost: 79
# insExtendCost: 82

# score matrix (query letters = columns, reference letters = rows):
#        A      C      G      T
# A     99   -396   -140   -450
# C   -366    136   -456   -301
# G   -135   -474    128   -461
# T   -465   -328   -498    102

#last -Q 0
#last -t4.46385
#last -a 12
#last -A 14
#last -b 4
#last -B 4
#last -S 1
# score matrix (query letters = columns, reference letters = rows):
       A      C      G      T
A      5    -20     -7    -23
C    -18      7    -23    -15
G     -7    -24      6    -23
T    -23    -16    -25      5

mcfrith commented 3 years ago

It looks like last-train has worked well (but slowly). I suggest repeat-masking: that's what we would typically do with 72G of human long reads. No need to re-run last-train. (Or you could try the -k or -uRY options mentioned here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst)

Since lastal processes the reads in order, by looking at the incomplete output, you can figure out how far it's got. I can only guess, there might be slowdown due to insufficient memory or multi-threading being ineffective for some reason (maybe top or something would show that).

LiShuhang-gif commented 3 years ago

Hi! Thank you for your reply. The following command has been running for 70 hours:

lastal -P 28 -p myseq.par mydb 17HanZZ0034.fastq | last-split > myseq.maf

And now the size of the file myseq.maf is 67M. Can you try to estimate the size of the file myseq.maf in general? I wonder if it's about to finish.

LiShuhang-gif commented 3 years ago

And my another question is:

lastal -P 28 -p myseq.par mydb 17HanZZ0034.fastq | last-split > myseq.maf

How much memory does this step normally require to run at normal speed?

mcfrith commented 3 years ago

I'd expect the final output to have roughly similar size to the input fastq file. Is it really only 67M?

Memory: with repeat-masking, for the human genome, I think < 20G. Without masking, it could be quite a lot more, sorry I'm not exactly sure.

Jesson-mark commented 3 years ago

Hi, I'm also using LAST to align my HiFi reads to genomes and encountered the same problem as LiShuhang. My program is running very slowly. When I change the threads to 48, lastal runs faster but not too much. It has been running for about 6 hours and the maf file is 235M. Due to the runtime limitation of our server(program can run at most 120 hours), I'm afraid that it won't be finished before the runtime limitation. My fq.gz file is about 90G. I wonder if the alignment can be broken down into multiple sub-tasks so that each sub-task can be finished before the runtime limitation. Thanks for your reply.

mcfrith commented 3 years ago

Sub-tasks: you could split your fq file into parts, and align each part separately. Here's one way (not tested):

gzip -cd fq.gz | split -l200000 - myPrefix-

That puts each 200000-line chunk into a different file named starting with myPrefix-.

But I recommend trying suggestions in the doc to go faster! (Especially for HiFi.) An easy one to try is lastal option -k64 (say).

Jesson-mark commented 3 years ago

Thanks for your suggestion. But I don't understand what the -k parameter mean, could you explain it in more detail?

Besides, I found that -i parameter can specify the memory lastal used. If I specify more larger memory, would lastal run more faster?

mcfrith commented 3 years ago

The -i parameter is explained here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst You're right: it increases the memory usage and probably increases speed (by making multi-threading more effective).

-k is explained here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-tuning.rst So -k64 checks every 64-th position in each DNA read, and if it finds a potential match to the genome, it tries extending an alignment from that position. I would expect quite high values of k to work fine for HiFi (even for non-HiFi), except that tiny rearranged fragments might be missed.

Jesson-mark commented 3 years ago

Thanks a lot! I will read the docs and try the parameters you suggested.

Jesson-mark commented 3 years ago

Hi, I have tried the -k and -i parameters using my data. Unfortunately, the improvement of speed was samll. For now, the result maf file can be generated 500M an hour. I have checked size of my fastq file and it is about 220G, that means I need nearly 20 days to run lastal for this file. Besides, I have other three fastq files that are about the same size of the first one. I know last is a great software but it is not realistic for me to run lastal for all files.

I need to use tandem-genotypes to find the changes in length of tandem repeats of my data. So if there are other solutions or softwares that can generate the maf file that tandem-genotypes used for input?

If so, please let me know. Thank you most sincerely.

mcfrith commented 3 years ago

Sorry it's not working great first time, or second time. I tested aligning some human HiFi reads (SRR9087598) to the human genome (hg38), with neither repeat-masking nor multi-threading. This is what I get:

lastal -k64 -p myMat myDb qry.fq | last-split 2.3G per hour

lastal -k64 -R00 -p myMat myDb qry.fq | last-split 3G per hour

lastal -k64 -m2 -p myMat myDb qry.fq | last-split 5.6G per hour

lastal -k64 -m2 -R00 -p myMat myDb qry.fq | last-split 12G per hour

I hope that's fast enough. The -R00 has no effect on the alignments: it skips detection and lowercasing of simple repeats in the reads. (But they are still detected and lowercased in the genome.)

With the -P multithreading option, you can make it a few-fold faster. Then last-split becomes the bottleneck, which can be overcome by using parallel-fastq (see https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md).

The peak memory use was 15G, so your computer should have comfortably more than that. (Else you could use -uRY32 instead of -k, or repeat-masking).

Jesson-mark commented 3 years ago

Sorry for not replying to you for a long time! I have tried to split my original file into some small files and one fastq file is 5.2G. Then I run last for this file using the parameters you suggested above. Since the runtime limit of one program is 120 hours, the alignment didn't finish before the limit. I don't know why. Maybe the reason is that our hardware can not let last perform best. Anyway, thanks for your patient answers.

mcfrith commented 3 years ago

How strange. Are you using the latest LAST (lastal --version)? The only other reason I can think of is your computer may not have enough memory (in which case I'd run lastdb with option -uRY32).

Jesson-mark commented 3 years ago

Yes, I'm using the latest LAST. Here is output of lastal --version:

$ lastal --version
lastal 1256

You are right. The reason why last run so slow is that it needs more memory. When I set -P10 -i30G -k64, lastal run faster than before. It took about 8 hours to finish the alignment of a 5G fastq file. I'm little confused about -i parameter. The doc(https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst) says -i can specify the batch size of input, so -i100 means to process one hundred sequences once a time?

Thanks for your reply. I will try it again.

mcfrith commented 3 years ago

-i100 means 100 bases at a time (https://gitlab.com/mcfrith/last/-/blob/main/doc/lastal.rst). But it always does at least one whole sequence at a time, so it effectively means one sequence at a time.

With multi-threading, the default is -i8M (undocumented, bad). I would expect increasing -i to not greatly change the run time in this case: I wonder how much faster it got. Maybe this default -i should be increased...

It still seems a bit slower than expected. You could maybe try -P10 -k64 -m2 -R00.

Jesson-mark commented 3 years ago

Yes, it's still slower than expected. I tried -P20 -i100G -k64 -m5 -R00 and the speed is not ideal. It took 54 hours to genearte the 23G file. While -P20 -i100G -k64 only need 34 hours to generate 23G, the -m5 -R00 seems to slowdown the speed. That is confusing.

mcfrith commented 3 years ago

I'm out of ideas. It seems impossible for -m5 -R00 to make it slower...

Jesson-mark commented 3 years ago

Thanks for your considerate replies! Maybe it is the -i100G parameter that slows the speed. Since the more sequences it processed, the slower speed the result file is generated. It's just my guess.

I will try other parameters combination, if there is anything different I will let you know.

Best wishes!

Jesson-mark commented 3 years ago

Hi, I used this parameters -P20 -k64 -m5 -R00(which removed -i parameter) and the generation speed of result file is much faster than before! Now it can generate about 3G an hour. Maybe it is the process of loading too many sequences into memory once that slows the speed.

While this problem is solved, I have another question. Does lastal support mutiprocessing? Our computational resources have many nodes and a node can use up to 28 threads. So if lastal can be run in two or more nodes at the same time?

mcfrith commented 3 years ago

To run it on 2 or more nodes: one way is to split the fastq file into parts with something like split -l200000, as mentioned earlier in this issue. Then align each part on a different node.

Alternatively, it should be possible by specifying suitable GNU parallel options to parallel-fastq (https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst). Actually I've never done that, but apparently GNU parallel can do it: https://www.gnu.org/software/parallel/parallel_tutorial.html#remote-execution

Jesson-mark commented 3 years ago

I see. It seems that running this program or other programs on multiple nodes is not commonly used. Thanks for you suggestions. If needed I will try GNU parallel.

Thanks again!

Jesson-mark commented 3 years ago

Hi, I found the reason why lastal is running slow on my computation server. It is the I/O speed of our conputation system that limit the speed of result file. When there are many high IO consuming tasks, the speed of result file of lastal is too slow. While the IO speed is limited, is there anything I can do to speed it up? May the priority of lastal be increased?

mcfrith commented 3 years ago

Aha: yes, I/O trouble could explain it. How about reducing the output size with gzip before writing it out, as mentioned here: https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md

Try to figure out which parts of your file system are on fast/local disks.

Jesson-mark commented 3 years ago

Thanks for your reply. I will try it.

Besides, I have another question. I used lastal to align human assemblies to reference genome and the parameters are -P24 -k64 -m5 -R00. While the program is running about 26 hours, the size of result maf file is just 1.4k. The memory used is about 3.5G. I don't know why it is so slow. Are last not suitable to align assemblies?

mcfrith commented 2 years ago

It should be suitable for aligning genome assemblies, and again I'm surprised it's not very fast, with those parameters.

One point is that the -P24 multithreading may be ineffective, and it might help to add something like -i3G as mentioned here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-parallel.rst

Jesson-mark commented 2 years ago

Yes, you are right. Finally I found the true reason why last is not very fast. It is the lack of space of our server that limits the speed. For now our disk space is enough, last can generate maf file about 3G per hour. So last is still great when processing our data!

Thanks for all the help you have had over the past days! Best wishes.

mcfrith / last-rna

About the runtime #10