kishwarshafin / helen

H.E.L.E.N. (Homopolymer Encoded Long-read Error-corrector for Nanopore)
MIT License
68 stars 9 forks source link

margin docker run fail #17

Closed lstxmu closed 4 years ago

lstxmu commented 4 years ago

hi, i ran the margin polishg progrecess (docker version) , and got a fail result. root@ecs-9875:/media/datarun/blnanodata/data# tail marginPolish.log /usr/bin/time -f '\nDEBUG_MAX_MEM:%M\nDEBUG_RUNTIME:%E\n' /opt/MarginPolish/build/marginPolish reads_2_assembly.bam new.fasta allParams.np.human.guppy-ff-233.json -t 32 -o output/marginpolish_images -f

Running OpenMP with 32 threads.

Parsing model parameters from file: allParams.np.human.guppy-ff-233.json Calloc failed with request for -2 lots of 16 bytes Command exited with non-zero status 1

DEBUG_MAX_MEM:3836 DEBUG_RUNTIME:0:00.00

Can you help me to fix it ?

kishwarshafin commented 4 years ago

Hello @lstxmu

So the model that you downloaded allParams.np.human.guppy-ff-233.json is corrupted. Can you please remove that file and download it this way:

wget https://raw.githubusercontent.com/UCSC-nanopore-cgl/MarginPolish/master/params/allParams.np.human.guppy-ff-235.json

This downloads the raw json file and makes sure you don't download html content.

Please run the same command with the newly downloaded model and it should work.

lstxmu commented 4 years ago

Hello@kishwarshafin Thanks for yout suggession, and it works. But after 2h later, i got other error message as follow : root@ecs-25a7:/media/datarun/data# tail marginPolish.log major: Invalid arguments to routine minor: Inappropriate type HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139675447654144:

000: ../../../src/H5D.c line 391 in H5Dclose(): not a dataset

major: Invalid arguments to routine
minor: Inappropriate type

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139675447654144:

000: ../../../src/H5G.c line 777 in H5Gclose(): not a group

major: Invalid arguments to routine
minor: Inappropriate type

Can you tell me what had happen?

kishwarshafin commented 4 years ago

Hi @lstxmu,

We have seen and solved this error before in here. If you can run the command with sudo, it should work.

However, if you can please run the simple non-sudo walkthrough then change your commands the same way then that should work too. The walkthrough is E Coli and on a 40 CPU machine, it should take about 15-20mins.

lstxmu commented 4 years ago

hi,kishwarshafin thanks for your advise, the issue was fixed. But i found an new issue: the assembly quality of shasta(after polished wiith marinpolish and helen) was worse than wtdbg2. I valued the quality with BUSCO: <shasta+marginpolish+helen> C:69.4%[S:68.6%,D:0.8%],F:12.7%,M:17.9%,n:4915 <wtdbg2+minimap2+pilon> C:94.8%[S:93.7%,D:1.1%],F:3.3%,M:1.9%,n:4915

Do you have the same issue? Looking for your reply.

Best, Luo

kishwarshafin commented 4 years ago

Hi Luo,

What sample/species is this? Which guppy version are you running?

Can you run BUSCO on the unpolished Shasta assembly to see how it performs? Also, wtdg2+minimap2, did you mean racon? Minimap2 can’t polish I believe.

lstxmu commented 4 years ago

Hi, kishwarshafin 1 The species i assembly was a bird, 2 I don't know guppy version, i installed the software follow your github instruction 3 i did not run BUSCO on the unpolished assembly result , i will run it and tell you the result in a few hours 4 I used pilon to polish wtdbg2 result 5 I used the chicken as the ref speices in august (run_BUSCO.py -i Assembly.fasta -c 144 -l /media/database/ncbidb/busco/aves_odb9 -m genome --out shastaraw -t /media/datarun3/temp/ -sp chicken)

kishwarshafin commented 4 years ago

I wad wondering about the sequencing protocol. Like, which basecaller version you used to basecall the raw reads and are all the data from ONT. Also, the raw/unpolished assembly comparison between Shasta and wtdbg2 would also help to answer the questions.

lstxmu commented 4 years ago

hi, kishwarshafin 1 I got the sequence result from novogene company in china, and they used guppy. 2 busco result : shasta:C:35.6%[S:35.4%,D:0.2%],F:6.2%,M:58.2%,n:4915 shasta+marginpolish: C:61.2%[S:60.5%,D:0.7%],F:13.0%,M:25.8%,n:4915 shasta+marginpolish+helen:C:69.4%[S:68.6%,D:0.8%],F:12.7%,M:17.9%,n:4915

kishwarshafin commented 4 years ago

Hello,

Do you have the raw wtdbg2 busco numbers? You can also polish the wtdbg2 with MP and HELEN to see some improvement.

I think the issue would be average read length. Do you happen to know the read N50 or have a plot of the read length distribution?

lstxmu commented 4 years ago

hi, kishwarshafin 1 .the raw wtdbg2 busco : C:94.8%[S:93.9%,D:0.9%],F:3.3%,M:1.9%,n:4915

  1. good suggestion, i will try it 3 read length statistic value as follow: General summary:
    Active channels: 2,678.0 Mean read length: 20,802.0 Mean read quality: 7.9 Median read length: 20,648.0 Median read quality: 8.6 Number of reads: 1,719,938.0 Read length N50: 27,841.0 Total bases: 35,778,191,178.0 Number, percentage and megabases of reads above quality cutoffs

    Q5: 1460902 (84.9%) 33972.4Mb Q7: 1243230 (72.3%) 29449.9Mb Q10: 249370 (14.5%) 5954.4Mb Q12: 179 (0.0%) 0.8Mb Q15: 0 (0.0%) 0.0Mb Top 5 highest mean basecall quality scores and their read lengths 1: 13.7 (212) 2: 13.4 (290) 3: 13.4 (300) 4: 13.4 (260) 5: 13.3 (357) Top 5 longest reads and their mean basecall quality score 1: 1292137 (4.3) 2: 513973 (4.2) 3: 438576 (4.1) 4: 346715 (4.0) 5: 257072 (3.0) LengthvsQualityScatterPlot_dot Weighted_LogTransformed_HistogramReadlength

kishwarshafin commented 4 years ago

Hi Luo,

We had a brief discussion in our group about your findings. We want to debug this issue with your help so you can get a proper answer.

We think this might be a coverage issue. As Shasta has a strict cutoff, you may lose coverage if your reads are on the shorter side. You can get coverage information from one of these files in the assembly directory: AssemblySummary.html, ReadLengthHistogram.csv, Binned-ReadLengthHistogram.csv, and also from log output (stdout).

Is there any way you can share these files with us so we can further debug and help you with this issue?

lstxmu commented 4 years ago

Hi, kishwarshafin Thanks for you reply . I would check the coverage information file (if they are still in the server , otherwise I will rerun the assembly progress again), please give me some time. If i can get these files, I would share with you, I will contact you as soon as possible. Best, Luo

kishwarshafin commented 4 years ago

Luo, thank you and take your time. Also, if you get time please run MP+HELEN on the wtdbg2 assembly to make sure it’s not a polishing issue we are seeing here. Thanks a ton for reporting on this.

lstxmu commented 4 years ago

Hi, kishwarshafin I checked the raw assembly directory and got the report files you asked for (except the log output).I had compressed them into the shastareport.zip If you insist to get the log file. i would take some time to rerun the assembly .
Best, Luo Selection_221 shastareport.zip

kishwarshafin commented 4 years ago

Hi Luo,

I'm copying over a comment regarding your run of Shasta. Please let us know if we can help anyway:

There is no coverage issue as Shasta is seeing 76 Gb of coverage and this genome is a bit above 1 Gb, so we are around 70x coverage. I suggest that they run Quast to obtain an estimate of sequence quality over the entire genome. And they should do a comparison of pre-polished quality, otherwise, it is impossible to tell if the accuracy issue is due to the assembly or the polishing. The pre-polished analysis should not use Busco as we know that pre-polished accuracy, for all assemblers, is generally not sufficient to make the Busco analysis meaningful.

kishwarshafin commented 4 years ago

I'd greatly appreciate if you can polish the wtdbg2 assembly with MP+HELEN and give us the results, that'd clarify if there's anything wrong with the polishing pipeline.

lstxmu commented 4 years ago

Hi,kishwarshafin I had just finished the polish job of wtdbg2 resulsts with MP+HELEN. The Busco result of polish assembly file is in progress . I will show you the values as it done. Please give me some time. Best, Luo

lstxmu commented 4 years ago

Hi, kishwarshafin The polished wtdbg2 with MP+Helen as follow: MP: C:63.8%[S:63.3%,D:0.5%],F:12.9%,M:23.3%,n:4915 helen: C:71.4%[S:70.6%,D:0.8%],F:12.1%,M:16.5%,n:4915

kishwarshafin commented 4 years ago

@lstxmu ,

This is very surprising. BUSCO analysis, in this case, seems to be very specific. I'm not exactly sure if it truly represents the sequence quality though. I'd suggest running Quast to get a better idea of what exactly the sequence qualities look like.

lstxmu commented 4 years ago

Hi, kishwarshafin Thanks for your adivce. I'll run Quast to compare these assembly file. This progress would take a few days, I would share the results to you once it is done.

kishwarshafin commented 4 years ago

Thanks @lstxmu , will wait until you are done with all the analysis.

lstxmu commented 4 years ago

Hi, kishwarshafin I had run the Quast test with wtdbg2 , wtdbg2+margin, wtdbg2+margin+helen. You can see the appendix zip file.
quastreport.zip

kishwarshafin commented 4 years ago

Thanks, @lstxmu , After seeing the result, I wonder if the reference you are using is suitable for the species you sequenced. The genome fraction is only 0.33. If you have sequenced and assembled everything correctly you should see a very high match between the reference and the assembly. If this is a non-model organism, I think you don't have the right reference. I can't tell specifically though as I don't have information about what reference you are using and what species have you sequenced.

lstxmu commented 4 years ago

Hi,@kishwarshafin The refernce genome was Gallus Gallus v6.0 (lastest version https://www.ncbi.nlm.nih.gov/assembly/GCF_000002315.6) , to be honest, gallus is not the close species to my sample (Egretta garzetta), But most avian genome result was not good enough and gallus was the model orgnism in avain study (also the intensive study species).

kishwarshafin commented 4 years ago

Hi @lstxmu ,

I am not an avian expert so really can't help you with this. But it looks like the tools are working fine. If you have any other issues with running the tools, please let us know. I'm closing this issue.

Thanks.

lstxmu commented 4 years ago

Hi,@kishwarshafin Thanks for your help. I still have a question: How much species you had test with shasta+marginpolish+helen? Can you share some detail with me?

kishwarshafin commented 4 years ago

As we reported in the paper, we extensively tested on the human genome. Outside the human genome, different labs have tried Shasta on fish and plant genomes and got satisfying results. Most reported on twitter.