marbl / merqury

k-mer based assembly evaluation
Other
280 stars 19 forks source link

Error in read.table(dat, header = F) : no lines available in input #109

Closed scthree closed 9 months ago

scthree commented 1 year ago

Dear Developers,

Thank you for this wonderful tool and your contributions in leading the community!

Forgive me if I'm asking a stupid question. I recently used merqury to eval several assemblies generated with hifiasm (HiFi + ONT, then either phased with Trio-binning OR HiC). Was able to run merqury and generate all the asm stats/plots.

Then just made a Verkko assembly (phased with Trio-binning) and ran merqury on the diploid assembly (assembly.fasta) and hap1+hap2 (assembly.haplotype1.fasta, assembly.haplotype2.fasta). No issues with qv, completeness, and generating those plots/stats...but get an Rscript error when it's trying to generate the phased_block.stats - please see error log below.

Could you please let me know what I'm doing wrong? I know you're busy and appreciate your time!

Sincerely, Steve


Haplotype dbs provided. Running Merqury in trio mode...

hap1: mat_uncompress.k21.hapmer.meryl hap2: pat_uncompress.k21.hapmer.meryl asm1: assembly.fasta out : dip

Get spectra-cn plots and QV stats Get blob plots Get haplotype specfic spectra-cn plots Get phase blocks Get block N plots

Generate assembly.fasta.fai

# Found assembly.gaps.bed

Found 246. Generating stats for both scaffolds and contigs.

Convert dip.assembly.100_20000.phased_block.bed to sizes

Result saved as dip.assembly.100_20000.phased_block.sizes

Plot dip.assembly.100_20000.phased_block.bed

Rscript /Programs/merqury-1.3/plot/plot_block_N.R -b dip.assembly.100_20000.phased_block.sizes -c dip.assembly.contig.sizes -s dip.assembly.scaff.sizes -o dip.assembly Loading required package: argparse Loading required package: ggplot2 Loading required package: scales Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

No block2 found. Done!

read: dtr_uncompress.k21.meryl

Haplotype dbs provided. Running Merqury in trio mode...

hap1: mat_uncompress.k21.hapmer.meryl hap2: pat_uncompress.k21.hapmer.meryl asm1: assembly.haplotype1.fasta asm2: assembly.haplotype2.fasta out : hap

Get spectra-cn plots and QV stats Get blob plots Get haplotype specfic spectra-cn plots Get phase blocks Get block N plots

Generate assembly.haplotype1.fasta.fai

# Found assembly.haplotype1.gaps.bed

Found 85. Generating stats for both scaffolds and contigs.

Generate assembly.haplotype2.fasta.fai

# Found assembly.haplotype2.gaps.bed

Found 161. Generating stats for both scaffolds and contigs.

Convert hap.assembly.haplotype1.100_20000.phased_block.bed to sizes

Result saved as hap.assembly.haplotype1.100_20000.phased_block.sizes

Plot hap.assembly.haplotype1.100_20000.phased_block.bed

Rscript /Programs/merqury-1.3/plot/plot_block_N.R -b hap.assembly.haplotype1.100_20000.phased_block.sizes -c hap.assembly.haplotype1.contig.sizes -s hap.assembly.haplotype1.scaff.sizes -o hap.assembly.haplotype1 Loading required package: argparse Loading required package: ggplot2 Loading required package: scales Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

Convert hap.assembly.haplotype2.100_20000.phased_block.bed to sizes

Result saved as hap.assembly.haplotype2.100_20000.phased_block.sizes

Plot hap.assembly.haplotype2.100_20000.phased_block.bed

Rscript /Programs/merqury-1.3/plot/plot_block_N.R -b hap.assembly.haplotype2.100_20000.phased_block.sizes -c hap.assembly.haplotype2.contig.sizes -s hap.assembly.haplotype2.scaff.sizes -o hap.assembly.haplotype2 Loading required package: argparse Loading required package: ggplot2 Loading required package: scales Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

arangrhie commented 1 year ago

Hello,

There was a serious issue in meryl v1.4 release, making meryl-lookup crash. I wonder this was also affected by this issue. If possible, could you obtain meryl v1.4.1 release, and re-run merqury?

If you'd like to re-run, you could you remove .sort.bed and .hap.wig, then re-run $MERQURY/trio/phase_block.sh. This has to be done for each assembly (so twice). Next, run $MERQURY/trio/block_n_stats.sh afterwards (also, twice). Since you are re-running it manually, you could also provide a genome size that could go with block_n_stats.sh to generate NG plots instead of N plots.

Best, Arang

Yichen-HNU commented 10 months ago

Hello, I think we have similar errors, but the difference is that I am running on version 1.4.1. I encountered similar errors during installation using both git clone and meryl-1.4.1.Linux-amd64.tar.xz. In the detailed documentation below, I executed the same commands by providing the reference sequence and parent-offspring databases, along with the two assembled haploid fa files.

Both the quality values (qv) and the generation of plots seem to be working fine, but I am unable to obtain the 100_20000.phased_block.stats and switch error. I cannot observe the completeness of the assembly. Based on the error messages, it seems to suggest that bedtools, samtools, and R are not present. However, I have confirmed that these tools exist in the current conda environment through the executed commands:

samtools Program: samtools (Tools for alignments in the SAM format) Version: 1.18 (using htslib 1.17) R --version R version 4.1.1 (2021-08-10) -- "Kick Things" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-conda-linux-gnu (64-bit) bedtools bedtools is a powerful toolset for genome arithmetic. Version: v2.31.1

Haplotype dbs provided. Running Merqury in trio mode...

hap1: chr21-db_M.meryl hap2: chr21-db_F.meryl asm1: hic_pred_hap1.fa asm2: hic_pred_hap2.fa out : hic

Get spectra-cn plots and QV stats

Get blob plots

Get haplotype specfic spectra-cn plots

Get phase blocks

Get block N plots ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'bedtools' ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'samtools' ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'R'

Generate hic_pred_hap1.fa.fasta.fai

[E::fai_build3_core] Failed to open the file hic_pred_hap1.fa.fasta [faidx] Could not build fai index hic_pred_hap1.fa.fasta.fai

# Found hic_pred_hap1.fa.gaps.bed

No gaps found. This is a contig set.

awk: fatal: cannot open file `hic_pred_hap1.fa.fasta.fai' for reading (No such file or directory)

Generate hic_pred_hap2.fa.fasta.fai

[E::fai_build3_core] Failed to open the file hic_pred_hap2.fa.fasta [faidx] Could not build fai index hic_pred_hap2.fa.fasta.fai

# Found hic_pred_hap2.fa.gaps.bed

No gaps found. This is a contig set.

awk: fatal: cannot open file `hic_pred_hap2.fa.fasta.fai' for reading (No such file or directory)

Convert hic.hic_pred_hap1.fa.100_20000.phased_block.bed to sizes

Result saved as hic.hic_pred_hap1.fa.100_20000.phased_block.sizes

Plot hic.hic_pred_hap1.fa.100_20000.phased_block.bed

Rscript /home/lyc/.conda/envs/porec/share/merqury/plot/plot_block_N.R -b hic.hic_pred_hap1.fa.100_20000.phased_block.sizes -c hic.hic_pred_hap1.fa.contig.sizes -o hic.hic_pred_hap1.fa Loading required package: argparse Warning message: package ‘argparse’ was built under R version 4.1.3 Loading required package: ggplot2 Warning message: package ‘ggplot2’ was built under R version 4.1.3 Loading required package: scales Warning message: package ‘scales’ was built under R version 4.1.3 Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

Convert hic.hic_pred_hap2.fa.100_20000.phased_block.bed to sizes

Result saved as hic.hic_pred_hap2.fa.100_20000.phased_block.sizes

Plot hic.hic_pred_hap2.fa.100_20000.phased_block.bed

Rscript /home/lyc/.conda/envs/porec/share/merqury/plot/plot_block_N.R -b hic.hic_pred_hap2.fa.100_20000.phased_block.sizes -c hic.hic_pred_hap2.fa.contig.sizes -o hic.hic_pred_hap2.fa Loading required package: argparse Warning message: package ‘argparse’ was built under R version 4.1.3 Loading required package: ggplot2 Warning message: package ‘ggplot2’ was built under R version 4.1.3 Loading required package: scales Warning message: package ‘scales’ was built under R version 4.1.3 Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

output

100_20000.phased_block: nothing 100_20000.switch.bed: nothing

I sincerely hope that you are familiar with this issue and can assist me. Thank you very much for your help!

arangrhie commented 10 months ago

Hello @qq923781272 ,

dafc7e31955ed7a1b6e4d89cfee93e1881759c6a will fix your crash. Also I see you don't need to load the modules. You could uncomment this line to avoid the ModuleCmd_Load ERRORs.

I'd suggest to clone the latest Merqury git from the master branch (nothing to install), and re run Merqury. Hope this works for you!

Best, Arang

Yichen-HNU commented 10 months ago

I apologize for bothering you again. I have checked whether my Merqury version matches the branch correction you sent me. I am certain it is the latest version, but I am still encountering similar issues. Just to be safe, I cloned a fresh copy of the code, but the same error persists.

What's even more puzzling is that I successfully generated the phased_block and 100_20000.switches.txt in a previous run yesterday. However, today, I have been unable to reproduce the success under any circumstances.

I followed your instructions in the previous response to remove and execute $MERQURY/trio/phase_block.sh and $MERQURY/trio/block_n_stats.sh twice each. However, during this process, errors occurred, preventing me from obtaining results for switch and phased_block.

Do you have any insights into why this might be happening? The command I used is $MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta out_prefix.

Here is the error message I encountered: Haplotype dbs provided. Running Merqury in trio mode...

hap1: chr21-db_M.meryl hap2: chr21-db_F.meryl asm1: hic_pred_hap1.fa asm2: hic_pred_hap2.fa out : hic

Get spectra-cn plots and QV stats

Get blob plots

Get haplotype specfic spectra-cn plots

Get phase blocks

Get block N plots ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'bedtools' ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'samtools' ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'R'

Generate hic_pred_hap1.fa.fasta.fai

[E::fai_build3_core] Failed to open the file hic_pred_hap1.fa.fasta [faidx] Could not build fai index hic_pred_hap1.fa.fasta.fai

# Found hic_pred_hap1.fa.gaps.bed

No gaps found. This is a contig set.

awk: fatal: cannot open file `hic_pred_hap1.fa.fasta.fai' for reading (No such file or directory)

Generate hic_pred_hap2.fa.fasta.fai

[E::fai_build3_core] Failed to open the file hic_pred_hap2.fa.fasta [faidx] Could not build fai index hic_pred_hap2.fa.fasta.fai

# Found hic_pred_hap2.fa.gaps.bed

No gaps found. This is a contig set.

awk: fatal: cannot open file `hic_pred_hap2.fa.fasta.fai' for reading (No such file or directory)

Convert hic.hic_pred_hap1.fa.100_20000.phased_block.bed to sizes

Result saved as hic.hic_pred_hap1.fa.100_20000.phased_block.sizes

Plot hic.hic_pred_hap1.fa.100_20000.phased_block.bed

Rscript /home/lyc/.conda/envs/porec/share/merqury/plot/plot_block_N.R -b hic.hic_pred_hap1.fa.100_20000.phased_block.sizes -c hic.hic_pred_hap1.fa.contig.sizes -o hic.hic_pred_hap1.fa Loading required package: argparse Loading required package: ggplot2 Loading required package: scales Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

Convert hic.hic_pred_hap2.fa.100_20000.phased_block.bed to sizes

Result saved as hic.hic_pred_hap2.fa.100_20000.phased_block.sizes

Plot hic.hic_pred_hap2.fa.100_20000.phased_block.bed

Rscript /home/lyc/.conda/envs/porec/share/merqury/plot/plot_block_N.R -b hic.hic_pred_hap2.fa.100_20000.phased_block.sizes -c hic.hic_pred_hap2.fa.contig.sizes -o hic.hic_pred_hap2.fa Loading required package: argparse Loading required package: ggplot2 Loading required package: scales Error in read.table(dat, header = F) : no lines available in input Calls: block_n -> attach_n -> read.table Execution halted

$MERQURY/trio/phase_block.sh

Found 1 command tree.

PROCESSING TREE #1 using 128 threads. opLessThan chr21-db_M.meryl print to (stdout) Detected k-mer size: 21 [E::fai_build3_core] Failed to open the file hic_pred_hap1.fa.fasta [faidx] Could not build fai index hic_pred_hap1.fa.fasta.fai

    Generate haplotype marker sites bed

            -- chr21-db_M

usage: meryl-lookup \ -sequence [] \ -output [] \ -mers [] [...] [-estimate] \ -labels [] [...]

Compare kmers in input sequences against kmers in input meryl databases.

Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or compressed with gzip, xz, or bzip2.

Report types:

-bed: Generate a BED format file showing the location of kmers in any input database on each sequence in 'input1.fasta'.

-wig-count: Generate a WIGGLE format file showing the multiplicity of the kmer starting at each position in the sequence, if it exists in an input kmer database.

-wig-depth: Generate a WIGGLE format file showing the number of kmers in any input database that cover each position in the sequence.

-existence: Generate a tab-delimited line for each input sequence with the number of kmers in the sequence, in the database and common to both.

-include: -exclude: Copy sequences from 'input1.fasta' (and 'input2.fasta') to the corresponding output file if the sequence has at least one kmer present (include) or no kmers present (exclude) in 'input1.meryl'.

Run meryl-lookup <report-type> -help for details on each method.

Unknown option '-dump'. No report-type (-bed, -wig-count, -wig-depth, -existence, -include, -exclude) supplied. Using system JDK. INFO [Nov 29,2023 18:47] [Globals] Development mode is enabled SEVERE [Nov 29,2023 18:47] [IgvTools] Genome definition file not found for: hic_pred_hap1.fa.fasta.fai SEVERE [Nov 29,2023 18:47] [IgvTools] org.broad.igv.tools.PreprocessingException: Genome definition file not found for: hic_pred_hap1.fa.fasta.fai at org.igv/org.broad.igv.tools.IgvTools.loadGenome(IgvTools.java:1156) at org.igv/org.broad.igv.tools.IgvTools.doCount(IgvTools.java:826) at org.igv/org.broad.igv.tools.IgvTools.run(IgvTools.java:371) at org.igv/org.broad.igv.tools.IgvTools.main(IgvTools.java:254)

            -- chr21-db_F

usage: meryl-lookup \ -sequence [] \ -output [] \ -mers [] [...] [-estimate] \ -labels [] [...]

Compare kmers in input sequences against kmers in input meryl databases.

Input sequences (-sequence) can be FASTA or FASTQ, uncompressed, or compressed with gzip, xz, or bzip2.

Report types:

-bed: Generate a BED format file showing the location of kmers in any input database on each sequence in 'input1.fasta'.

-wig-count: Generate a WIGGLE format file showing the multiplicity of the kmer starting at each position in the sequence, if it exists in an input kmer database.

-wig-depth: Generate a WIGGLE format file showing the number of kmers in any input database that cover each position in the sequence.

-existence: Generate a tab-delimited line for each input sequence with the number of kmers in the sequence, in the database and common to both.

-include: -exclude: Copy sequences from 'input1.fasta' (and 'input2.fasta') to the corresponding output file if the sequence has at least one kmer present (include) or no kmers present (exclude) in 'input1.meryl'.

Run meryl-lookup <report-type> -help for details on each method.

Unknown option '-dump'. No report-type (-bed, -wig-count, -wig-depth, -existence, -include, -exclude) supplied. Using system JDK. INFO [Nov 29,2023 18:47] [Globals] Development mode is enabled SEVERE [Nov 29,2023 18:47] [IgvTools] Genome definition file not found for: hic_pred_hap1.fa.fasta.fai SEVERE [Nov 29,2023 18:47] [IgvTools] org.broad.igv.tools.PreprocessingException: Genome definition file not found for: hic_pred_hap1.fa.fasta.fai at org.igv/org.broad.igv.tools.IgvTools.loadGenome(IgvTools.java:1156) at org.igv/org.broad.igv.tools.IgvTools.doCount(IgvTools.java:826) at org.igv/org.broad.igv.tools.IgvTools.run(IgvTools.java:371) at org.igv/org.broad.igv.tools.IgvTools.main(IgvTools.java:254)

    Sort hic.bed

/home/lyc/.conda/envs/porec/share/merqury/trio/switch_error.sh hic.sort.bed hic 100 20000

    java -jar -Xmx1g /home/lyc/.conda/envs/porec/share/merqury/trio/bedMerToPhaseBlock.jar hic.sort.bed hic.100_20000 100 20000 

Processing file hic.sort.bed Found and as haplotypes. Running time : 0 h 0 m 0 sec

awk: cmd. line:1: fatal: division by zero attempted

java -jar -Xmx1g /home/lyc/.conda/envs/porec/share/merqury/eval/bedCalcN50.jar hic.100_20000.phased_block.bed | tail -n1 | awk -v out=hic.100_20000 -v swi="" '{print out"\t"$0"\tswi}' - >> hic.100_20000.phased_block.stats Processing file hic.100_20000.phased_block.bed Num. blocks: 0 Genome covered bases: 0 Exception in thread "main" java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0 at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266) at java.base/java.util.Objects.checkIndex(Objects.java:359) at java.base/java.util.ArrayList.get(ArrayList.java:427) at javax.arang.bed.CalcN50.hooker(CalcN50.java:41) at javax.arang.IO.Rwrapper.go(Rwrapper.java:14) at javax.arang.bed.CalcN50.main(CalcN50.java:66)

Count and hap-mers per block to hic.100_20000.phased_block.counts ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'R'

Rscript /home/lyc/.conda/envs/porec/share/merqury/plot/plot_blob.R -f hic.100_20000.phased_block.counts -o hic.100_20000.phased_block.blob Loading required package: argparse Loading required package: ggplot2 Loading required package: scales Error in [.data.frame(dat, , 4) : undefined columns selected Calls: blob_plot -> [ -> [.data.frame In addition: Warning message: In max(dat[, 3]) : no non-missing arguments to max; returning -Inf Execution halted

Thank you once again for your response, and I apologize for any inconvenience caused!