Nanopore data: MARVEL can not handle sequences longer than 65535 Unicode code units

hoelzer commented 4 years ago

Nothing important for now, but I was curious and just threw some Nanopore reads into the workflow, MARVEL reported:

Error executing process > 'marvel_wf:marvel (1)'

Caused by:
  Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/groovy/script/ScriptB029065F62D4DB146CF4723657DAAB3D: 1: String too long. The given string is 13054086 Unicode code units long, but only a maximum of 65535 is allowed.
 @ line 1, column 16.
   __$$_out.print("""
                  ^

1 error

So it seems a length filter needs to be applied for MARVEL when used with long sequences. Are actually all input test data sets smaller than this size? I mean, it is totally possible that someone just starts the workflow with a FASTA file containing longer sequences.

replikation commented 4 years ago

i think its the .splitFasta() ?? that would be annoying

replikation commented 4 years ago

i think a samtools module in front of it should fix this easily.

replikation commented 4 years ago

we might need to change the test data set a bit so we can more easily test things in the future. @Stormrider935 maybe we should add like 3 good test datasets. 1 lot of small contigs. 1 with a medium amount of middle/normal sizes and one were basically a few genomes are in there.

Because people will drop everything in this tool 🗡

replikation commented 4 years ago

@hoelzer does string to long mean to many contigs or a contig is to long?

hoelzer commented 4 years ago

@replikation good question, my first impression was that contigs (or in that case reads) are simply too long. But could be also possible that there are too many contigs/reads.

My input FASTA has ~800MB.

Anyway, without MARVEL the other tools finished:

executor >  lsf (11)
[98/7a3b76] process > virsorter_wf:input_suffix_check (1)       [100%] 1 of 1, cached: 1 ✔
[9f/df06bf] process > virsorter_wf:virsorter (1)                [100%] 1 of 1 ✔
[14/799570] process > virsorter_wf:filter_virsorter (1)         [100%] 1 of 1 ✔
[b6/09fb1a] process > metaphinder_wf:input_suffix_check (1)     [100%] 1 of 1, cached: 1 ✔
[6d/67c4c5] process > metaphinder_wf:metaphinder (1)            [100%] 1 of 1 ✔
[bf/0ccce7] process > metaphinder_wf:filter_metaphinder (1)     [100%] 1 of 1 ✔
[0f/959cf6] process > deepvirfinder_wf:input_suffix_check (1)   [100%] 1 of 1, cached: 1 ✔
[27/e0e1b0] process > deepvirfinder_wf:deepvirfinder (1)        [100%] 1 of 1 ✔
[18/19aba7] process > deepvirfinder_wf:filter_deepvirfinder (1) [100%] 1 of 1 ✔
[e4/6043c0] process > virfinder_wf:input_suffix_check (1)       [100%] 1 of 1, cached: 1 ✔
[02/d913c2] process > virfinder_wf:virfinder (1)                [100%] 1 of 1 ✔
[4e/9a7dae] process > virfinder_wf:filter_virfinder (1)         [100%] 1 of 1 ✔
[ff/c1ae1e] process > pprmeta_wf:input_suffix_check (1)         [100%] 1 of 1, cached: 1 ✔
[b6/d84f86] process > pprmeta_wf:pprmeta (1)                    [100%] 1 of 1 ✔
[88/63175e] process > pprmeta_wf:filter_PPRmeta (1)             [100%] 1 of 1 ✔
[1b/db967e] process > r_plot (1)                                [100%] 1 of 1, failed: 1 ✘
WARN: Input tuple does not match input set cardinality declared by process `r_plot` -- offending value: [SRR8811960_1.unclassified, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/14/79957041d5d5f010b6f91cbc1dd3d6/virsorter.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/bf/0ccce79516c6e5cc0f6ee9c79fa3c1/metaphinder.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/18/19aba75a38d821c5d00ccace25a2e1/deepvirfinder.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/4e/9a7daece0c8103633a90b0c1da8cb1/virfinder.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/88/63175e2b934b78d008e67446792c3e/PPRmeta.txt]

And just the r_plot did not work. But maybe because the MARVEL output is missing?

Error executing process > 'r_plot (1)'

Caused by:
  Process `r_plot (1)` terminated with an error exit status (1)

Command executed:

  convert.sh
  heatmap.R summary.csv

Command exit status:
  1

Command output:
  (empty)

Command error:
  sort: cannot create temporary file in '/scratch': No such file or directory
  sort: cannot create temporary file in '/scratch': No such file or directory
  sort: cannot create temporary file in '/scratch': No such file or directory

  Attaching package: ‘dplyr’

  The following object is masked from ‘package:gridExtra’:

      combine

  The following objects are masked from ‘package:stats’:

      filter, lag

  The following objects are masked from ‘package:base’:

      intersect, setdiff, setequal, union

  Attaching package: ‘tidyr’

  The following object is masked from ‘package:reshape2’:

      smiths

  Error in FUN(X[[i]], ...) : object 'variable' not found
  Calls: <Anonymous> ... <Anonymous> -> f -> scales_add_defaults -> lapply -> FUN
  Execution halted

I just had a rough look at the VirSorter output: 220 phage sequences and mostly cat-2.

hoelzer commented 4 years ago

@r_plot error: nope, the problem is not that the MARVEL output is missing, tested it.

replikation commented 4 years ago

yes its the missing marvel output. it would be good to know if its contig size or amount. because the amount would not be an issue, because raw reads will get a separate file "entry" (--fastq) and a samtool and fastq to fasta conversion. but if its the size ill need to adjust MARVEL accordingly

hoelzer commented 4 years ago

@missing marvel output, ahh ok because your R script is not dynamic in the sense of input files? ok.

@contig size or amount: yeah, we have to investigate this. Can't say at the moment without further testing.

replikation commented 4 years ago

@missing: I tried it as a groupTuple() to avoid an explicit input channel (my scripts don't care for instance - the files only have to be present). but somehow I got strange errors with nextflow. So it's more a channel handling issue :D

replikation commented 4 years ago

if its based on "amount of input" it should be fixed in a99a759 with the --fastq input.
reopen if this error happens again

hoelzer commented 4 years ago

I just tried this input file:

https://github.com/hoelzer/CWL_viral_pipeline/tree/master/CWL/Files_for_test

and run again into this error:

Error executing process > 'marvel_wf:marvel (1)'

Caused by:
  Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/groovy/script/ScriptE9E90C6982D77ECE92405BEF7E1F77D3: 1: String too long. The given string is 82394 Unicode code units long, but only a maximum of 65535 is allowed.
 @ line 1, column 16.
   __$$_out.print("""
                  ^

1 error

Source block:
  """
        rnd=${Math.random()}
        mkdir fasta_dir_${name}
        cp ${fasta} fasta_dir_${name}/
        # Marvel
        marvel_bins.py -i fasta_dir_${name} -t ${params.cpus} > results.txt
        # getting contig names
        filenames=\$(grep  "${name}\\." results.txt | cut -f2 -d " ")
        while IFS= read -r samplename ; do
         head -1 fasta_dir_${name}/\${samplename}.fa >> ${name}_\${rnd//0.}.list
        done < <(printf '%s\n' "\${filenames}")
        """

There are indeed some long contigs:

(base) ➜  What_the_Phage git:(master) grep ">" /homes/mhoelzer/backuped/git/CWL_viral_pipeline/CWL/Files_for_test/ERR575691_host_filtered.fasta | head
>NODE_1_length_576855_cov_17.155733
>NODE_2_length_264378_cov_11.585458
>NODE_3_length_109381_cov_8.571337
>NODE_4_length_86514_cov_9.751212
>NODE_5_length_65088_cov_10.862885
>NODE_6_length_61270_cov_3.963653
>NODE_7_length_53681_cov_7.396207
>NODE_8_length_46689_cov_12.914976
>NODE_9_length_44408_cov_5.457669
>NODE_10_length_41715_cov_27633.072156

As a test, I removed all sequences larger 100k nt: still the error.

Then I removed all larger 50k nt. Still.

Then I removed a bunch of short (< ~1.2k) sequences and then it was running.

Next, I used the original file again (with the very long sequences) and just removed all contigs smaller 500nt. And it was working. I dont know what the minimum is that MARVEL accepts, but it seems there should be an initial length filter step included to the pipeline.

I just tested removing everything smaller 100 nt: error occurs again. smaller 150: error smaller 200: error smaller 300: its running

By removing all contigs smaller 300nt I reduced the number of contigs from 2593

to 1151

As another test, I removed the longest contigs until I only got a file of 1578

contigs (including all the short ones). and got some other weird error.

TLDR:

I would suggest implementing a length filter (at least for MARVEL) with a cutoff of 300nt. What do you think?

replikation commented 4 years ago

i think i identified the error, and what it is talking about @hoelzer ?

grep ">" ERR575691_host_filtered.fasta | wc -c gives me 89773.

So its talking about the total size of all contig headers. it has nothing to do with the amount (only indirectly - more read names more characters) or the sequences.

Error Message: String too long. The given string is 82394 Unicode code units long

replikation commented 4 years ago

It is not a Marvel Error btw. its an error from the .splitFasta statement. the solution would be to do a process instead of the filter.

replikation commented 4 years ago

the workaround would be to break your file into 2 parts and then put it into the workflow. until I rewrite that part. If the splitting the file into 2 part works - tell me because then its most definitely this issue.

replikation commented 4 years ago

it is because marvel wants each contig as a separate file, to work like a tool to analyze "bins" (a folder with lots of fasta files). so we create out of a fasta or fastq files one file per contig/read and simulate a bin dir. then we know the information per contig. instead of a "fasta contains phage"

Or at least we wrote it this way to be more useful.

hoelzer commented 4 years ago

Oh snap. Ok, I see. I will test splitting the file into two chunks and report.

hoelzer commented 4 years ago

@replikation ok this is a bit odd and I am nut sure what is going on

I split the file into smaller chunks of 500 sequences, actually with this nice awk command:

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%500==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < ERR575691_host_filtered.fasta

Please find here one of the chunked fasta files: myseq1500.fa.zip

that gives me now this error:

Error executing process > 'marvel_wf:marvel (1)'

Caused by:
  Process `marvel_wf:marvel (1)` terminated with an error exit status (1)

Command executed:

  rnd=0.6634756290061703
        mkdir fasta_dir_myseq1500
        cp myseq1500.1.fa myseq1500.2.fa myseq1500.3.fa myseq1500.4.fa myseq1500.5.fa myseq1500.6.fa myseq1500.7.fa myseq1500.8.fa myseq1500.9.fa myseq1500.10.fa myseq1500.11.fa myseq1500.12.fa myseq1500.13.fa myseq1500.14.fa myseq1500.15.fa myseq1500.16.fa myseq1500.17.fa myseq1500.18.fa myseq1500.19.fa myseq1500.20.fa myseq1500.21.fa myseq1500.22.fa myseq1500.23.fa myseq1500.24.fa myseq1500.25.fa myseq1500.26.fa myseq1500.27.fa myseq1500.28.fa myseq1500.29.fa myseq1500.30.fa myseq1500.31.fa myseq1500.32.fa myseq1500.33.fa myseq1500.34.fa myseq1500.35.fa myseq1500.36.fa myseq1500.37.fa myseq1500.38.fa myseq1500.39.fa myseq1500.40.fa myseq1500.41.fa myseq1500.42.fa myseq1500.43.fa myseq1500.44.fa myseq1500.45.fa myseq1500.46.fa myseq1500.47.fa myseq1500.48.fa myseq1500.49.fa myseq1500.50.fa myseq1500.51.fa myseq1500.52.fa myseq1500.53.fa myseq1500.54.fa myseq1500.55.fa myseq1500.56.fa myseq1500.57.fa myseq1500.58.fa myseq1500.59.fa myseq1500.60.fa myseq1500.61.fa myseq1500.62.fa myseq1500.63.fa myseq1500.64.fa myseq1500.65.fa myseq1500.66.fa myseq1500.67.fa myseq1500.68.fa myseq1500.69.fa myseq1500.70.fa myseq1500.71.fa myseq1500.72.fa myseq1500.73.fa myseq1500.74.fa myseq1500.75.fa myseq1500.76.fa myseq1500.77.fa myseq1500.78.fa myseq1500.79.fa myseq1500.80.fa myseq1500.81.fa myseq1500.82.fa myseq1500.83.fa myseq1500.84.fa myseq1500.85.fa myseq1500.86.fa myseq1500.87.fa myseq1500.88.fa myseq1500.89.fa myseq1500.90.fa myseq1500.91.fa myseq1500.92.fa myseq1500.93.fa myseq1500.94.fa myseq1500.95.fa myseq1500.96.fa myseq1500.97.fa myseq1500.98.fa myseq1500.99.fa myseq1500.100.fa myseq1500.101.fa myseq1500.102.fa myseq1500.103.fa myseq1500.104.fa myseq1500.105.fa myseq1500.106.fa myseq1500.107.fa myseq1500.108.fa myseq1500.109.fa myseq1500.110.fa myseq1500.111.fa myseq1500.112.fa myseq1500.113.fa myseq1500.114.fa myseq1500.115.fa myseq1500.116.fa myseq1500.117.fa myseq1500.118.fa myseq1500.119.fa myseq1500.120.fa myseq1500.121.fa myseq1500.122.fa myseq1500.123.fa myseq1500.124.fa myseq1500.125.fa myseq1500.126.fa myseq1500.127.fa myseq1500.128.fa myseq1500.129.fa myseq1500.130.fa myseq1500.131.fa myseq1500.132.fa myseq1500.133.fa myseq1500.134.fa myseq1500.135.fa myseq1500.136.fa myseq1500.137.fa myseq1500.138.fa myseq1500.139.fa myseq1500.140.fa myseq1500.141.fa myseq1500.142.fa myseq1500.143.fa myseq1500.144.fa myseq1500.145.fa myseq1500.146.fa myseq1500.147.fa myseq1500.148.fa myseq1500.149.fa myseq1500.150.fa myseq1500.151.fa myseq1500.152.fa myseq1500.153.fa myseq1500.154.fa myseq1500.155.fa myseq1500.156.fa myseq1500.157.fa myseq1500.158.fa myseq1500.159.fa myseq1500.160.fa myseq1500.161.fa myseq1500.162.fa myseq1500.163.fa myseq1500.164.fa myseq1500.165.fa myseq1500.166.fa myseq1500.167.fa myseq1500.168.fa myseq1500.169.fa myseq1500.170.fa myseq1500.171.fa myseq1500.172.fa myseq1500.173.fa myseq1500.174.fa myseq1500.175.fa myseq1500.176.fa myseq1500.177.fa myseq1500.178.fa myseq1500.179.fa myseq1500.180.fa myseq1500.181.fa myseq1500.182.fa myseq1500.183.fa myseq1500.184.fa myseq1500.185.fa myseq1500.186.fa myseq1500.187.fa myseq1500.188.fa myseq1500.189.fa myseq1500.190.fa myseq1500.191.fa myseq1500.192.fa myseq1500.193.fa myseq1500.194.fa myseq1500.195.fa myseq1500.196.fa myseq1500.197.fa myseq1500.198.fa myseq1500.199.fa myseq1500.200.fa myseq1500.201.fa myseq1500.202.fa myseq1500.203.fa myseq1500.204.fa myseq1500.205.fa myseq1500.206.fa myseq1500.207.fa myseq1500.208.fa myseq1500.209.fa myseq1500.210.fa myseq1500.211.fa myseq1500.212.fa myseq1500.213.fa myseq1500.214.fa myseq1500.215.fa myseq1500.216.fa myseq1500.217.fa myseq1500.218.fa myseq1500.219.fa myseq1500.220.fa myseq1500.221.fa myseq1500.222.fa myseq1500.223.fa myseq1500.224.fa myseq1500.225.fa myseq1500.226.fa myseq1500.227.fa myseq1500.228.fa myseq1500.229.fa myseq1500.230.fa myseq1500.231.fa myseq1500.232.fa myseq1500.233.fa myseq1500.234.fa myseq1500.235.fa myseq1500.236.fa myseq1500.237.fa myseq1500.238.fa myseq1500.239.fa myseq1500.240.fa myseq1500.241.fa myseq1500.242.fa myseq1500.243.fa myseq1500.244.fa myseq1500.245.fa myseq1500.246.fa myseq1500.247.fa myseq1500.248.fa myseq1500.249.fa myseq1500.250.fa myseq1500.251.fa myseq1500.252.fa myseq1500.253.fa myseq1500.254.fa myseq1500.255.fa myseq1500.256.fa myseq1500.257.fa myseq1500.258.fa myseq1500.259.fa myseq1500.260.fa myseq1500.261.fa myseq1500.262.fa myseq1500.263.fa myseq1500.264.fa myseq1500.265.fa myseq1500.266.fa myseq1500.267.fa myseq1500.268.fa myseq1500.269.fa myseq1500.270.fa myseq1500.271.fa myseq1500.272.fa myseq1500.273.fa myseq1500.274.fa myseq1500.275.fa myseq1500.276.fa myseq1500.277.fa myseq1500.278.fa myseq1500.279.fa myseq1500.280.fa myseq1500.281.fa myseq1500.282.fa myseq1500.283.fa myseq1500.284.fa myseq1500.285.fa myseq1500.286.fa myseq1500.287.fa myseq1500.288.fa myseq1500.289.fa myseq1500.290.fa myseq1500.291.fa myseq1500.292.fa myseq1500.293.fa myseq1500.294.fa myseq1500.295.fa myseq1500.296.fa myseq1500.297.fa myseq1500.298.fa myseq1500.299.fa myseq1500.300.fa myseq1500.301.fa myseq1500.302.fa myseq1500.303.fa myseq1500.304.fa myseq1500.305.fa myseq1500.306.fa myseq1500.307.fa myseq1500.308.fa myseq1500.309.fa myseq1500.310.fa myseq1500.311.fa myseq1500.312.fa myseq1500.313.fa myseq1500.314.fa myseq1500.315.fa myseq1500.316.fa myseq1500.317.fa myseq1500.318.fa myseq1500.319.fa myseq1500.320.fa myseq1500.321.fa myseq1500.322.fa myseq1500.323.fa myseq1500.324.fa myseq1500.325.fa myseq1500.326.fa myseq1500.327.fa myseq1500.328.fa myseq1500.329.fa myseq1500.330.fa myseq1500.331.fa myseq1500.332.fa myseq1500.333.fa myseq1500.334.fa myseq1500.335.fa myseq1500.336.fa myseq1500.337.fa myseq1500.338.fa myseq1500.339.fa myseq1500.340.fa myseq1500.341.fa myseq1500.342.fa myseq1500.343.fa myseq1500.344.fa myseq1500.345.fa myseq1500.346.fa myseq1500.347.fa myseq1500.348.fa myseq1500.349.fa myseq1500.350.fa myseq1500.351.fa myseq1500.352.fa myseq1500.353.fa myseq1500.354.fa myseq1500.355.fa myseq1500.356.fa myseq1500.357.fa myseq1500.358.fa myseq1500.359.fa myseq1500.360.fa myseq1500.361.fa myseq1500.362.fa myseq1500.363.fa myseq1500.364.fa myseq1500.365.fa myseq1500.366.fa myseq1500.367.fa myseq1500.368.fa myseq1500.369.fa myseq1500.370.fa myseq1500.371.fa myseq1500.372.fa myseq1500.373.fa myseq1500.374.fa myseq1500.375.fa myseq1500.376.fa myseq1500.377.fa myseq1500.378.fa myseq1500.379.fa myseq1500.380.fa myseq1500.381.fa myseq1500.382.fa myseq1500.383.fa myseq1500.384.fa myseq1500.385.fa myseq1500.386.fa myseq1500.387.fa myseq1500.388.fa myseq1500.389.fa myseq1500.390.fa myseq1500.391.fa myseq1500.392.fa myseq1500.393.fa myseq1500.394.fa myseq1500.395.fa myseq1500.396.fa myseq1500.397.fa myseq1500.398.fa myseq1500.399.fa myseq1500.400.fa myseq1500.401.fa myseq1500.402.fa myseq1500.403.fa myseq1500.404.fa myseq1500.405.fa myseq1500.406.fa myseq1500.407.fa myseq1500.408.fa myseq1500.409.fa myseq1500.410.fa myseq1500.411.fa myseq1500.412.fa myseq1500.413.fa myseq1500.414.fa myseq1500.415.fa myseq1500.416.fa myseq1500.417.fa myseq1500.418.fa myseq1500.419.fa myseq1500.420.fa myseq1500.421.fa myseq1500.422.fa myseq1500.423.fa myseq1500.424.fa myseq1500.425.fa myseq1500.426.fa myseq1500.427.fa myseq1500.428.fa myseq1500.429.fa myseq1500.430.fa myseq1500.431.fa myseq1500.432.fa myseq1500.433.fa myseq1500.434.fa myseq1500.435.fa myseq1500.436.fa myseq1500.437.fa myseq1500.438.fa myseq1500.439.fa myseq1500.440.fa myseq1500.441.fa myseq1500.442.fa myseq1500.443.fa myseq1500.444.fa myseq1500.445.fa myseq1500.446.fa myseq1500.447.fa myseq1500.448.fa myseq1500.449.fa myseq1500.450.fa myseq1500.451.fa myseq1500.452.fa myseq1500.453.fa myseq1500.454.fa myseq1500.455.fa myseq1500.456.fa myseq1500.457.fa myseq1500.458.fa myseq1500.459.fa myseq1500.460.fa myseq1500.461.fa myseq1500.462.fa myseq1500.463.fa myseq1500.464.fa myseq1500.465.fa myseq1500.466.fa myseq1500.467.fa myseq1500.468.fa myseq1500.469.fa myseq1500.470.fa myseq1500.471.fa myseq1500.472.fa myseq1500.473.fa myseq1500.474.fa myseq1500.475.fa myseq1500.476.fa myseq1500.477.fa myseq1500.478.fa myseq1500.479.fa myseq1500.480.fa myseq1500.481.fa myseq1500.482.fa myseq1500.483.fa myseq1500.484.fa myseq1500.485.fa myseq1500.486.fa myseq1500.487.fa myseq1500.488.fa myseq1500.489.fa myseq1500.490.fa myseq1500.491.fa myseq1500.492.fa myseq1500.493.fa myseq1500.494.fa myseq1500.495.fa myseq1500.496.fa myseq1500.497.fa myseq1500.498.fa myseq1500.499.fa myseq1500.500.fa fasta_dir_myseq1500/
        # Marvel
       marvel_bins.py -i fasta_dir_myseq1500 -t 8 > results.txt
        # getting contig names
        filenames=$(grep  "myseq1500\." results.txt | cut -f2 -d " ")
        while IFS= read -r samplename ; do
         head -1 fasta_dir_myseq1500/${samplename}.fa >> myseq1500_${rnd//0.}.list
        done < <(printf '%s
  ' "${filenames}")

Command exit status:
  1

Command output:
  (empty)

Command error:
  /usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)
  /usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)
  Traceback (most recent call last):
    File "/MARVEL/marvel_bins.py", line 317, in <module>
      y_test_predicted_pickle = pickle_model.predict(array_bins[:, ])
    File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 545, in predict
      proba = self.predict_proba(X)
    File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 588, in predict_proba
      X = self._validate_X_predict(X)
    File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 359, in _validate_X_predict
      return self.estimators_[0]._validate_X_predict(X, check_input=True)
    File "/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py", line 391, in _validate_X_predict
      X = check_array(X, dtype=DTYPE, accept_sparse="csr")
    File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 521, in check_array
      "if it contains a single sample.".format(array))
  ValueError: Expected 2D array, got 1D array instead:
  array=[].
  Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

hoelzer commented 4 years ago

btw same issue on my local machine (macbook)

hoelzer commented 4 years ago

I think this is related to the way marvel module is implemented. For larger files (~10k sequences) I get (on LSF cluster and local machine) errors like:

Error executing process > 'marvel_wf:marvel (3)'

Caused by:
  Process `marvel_wf:marvel (3)` terminated with an error exit status (1)

Command executed:

  rnd=0.2659904844845017
        mkdir fasta_dir_ERR575691_host_filtered_filt500bp
        cp ERR575691_host_filtered_filt500bp.1.fa ERR575691_host_filtered_filt500bp.2.fa ERR575691_host_filtered_filt500bp.3.fa ERR575691_host_filtered_filt500bp.4.fa ERR575691_host_filtered_filt500bp.5.fa ERR575691_host_filtered_filt500bp.6.fa ERR575691_host_filtered_filt500bp.7.fa ERR575691_host_filtered_filt500bp.8.fa ERR575691_host_filtered_filt500bp.9.fa ERR575691_host_filtered_filt500bp.10.fa ERR575691_host_filtered_filt500bp.11.fa ERR575691_host_filtered_filt500bp.12.fa ERR575691_host_filtered_filt500bp.13.fa ERR575691_host_filtered_filt500bp.14.fa ERR575691_host_filtered_filt500bp.15.fa ERR575691_host_filtered_filt500bp.16.fa ERR575691_host_filtered_filt500bp.17.fa ERR575691_host_filtered_filt500bp.18.fa ERR575691_host_filtered_filt500bp.19.fa ERR575691_host_filtered_filt500bp.20.fa ERR575691_host_filtered_filt500bp.21.fa ERR575691_host_filtered_filt500bp.22.fa ERR575691_host_filtered_filt500bp.23.fa ... ERR575691_host_filtered_filt500bp.491.fa ERR575691_host_filtered_filt500bp.492.fa ERR575691_host_filtered_filt500bp.493.fa ERR575691_host_filtered_filt500bp.494.fa ERR575691_host_filtered_filt500bp.495.fa ERR575691_host_filtered_filt500bp.496.fa ERR575691_host_filtered_filt500bp.497.fa ERR575691_host_filtered_filt500bp.498.fa ERR575691_host_filtered_filt500bp.499.fa ERR575691_host_filtered_filt500bp.500.fa ERR575691_host_filtered_filt500bp.501.fa ERR575691_host_filtered_filt500bp.502.fa ERR575691_host_filtered_filt500bp.503.fa fasta_dir_ERR575691_host_filtered_filt500bp/
        # Marvel
        marvel_bins.py -i fasta_dir_ERR575691_host_filtered_filt500bp -t 8 > results.txt
        # getting contig names
        filenames=$(grep  "ERR575691_host_filtered_filt500bp\." results.txt | cut -f2 -d " ")
        while IFS= read -r samplename ; do
         head -1 fasta_dir_ERR575691_host_filtered_filt500bp/${samplename}.fa >> ERR575691_host_filtered_filt500bp_${rnd//0.}.list
        done < <(printf '%s
  ' "${filenames}")

Command exit status:
  1

Command output:
  (empty)

Command error:
  /usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)
  /usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
    UserWarning)
  head: cannot open 'fasta_dir_ERR575691_host_filtered_filt500bp/has.fa' for reading: No such file or directory

replikation commented 4 years ago

@hoelzer we identified the bug, it was just an issue with 100 % negative files we fixed this. WtP now runs completely with files who dont have any phages.

replikation / What_the_Phage

Nanopore data: MARVEL can not handle sequences longer than 65535 Unicode code units #20

TLDR: