Closed hoelzer closed 4 years ago
i think its the .splitFasta()
?? that would be annoying
i think a samtools module in front of it should fix this easily.
we might need to change the test data set a bit so we can more easily test things in the future. @Stormrider935 maybe we should add like 3 good test datasets. 1 lot of small contigs. 1 with a medium amount of middle/normal sizes and one were basically a few genomes are in there.
Because people will drop everything in this tool 🗡
@hoelzer does string to long mean to many contigs or a contig is to long?
@replikation good question, my first impression was that contigs (or in that case reads) are simply too long. But could be also possible that there are too many contigs/reads.
My input FASTA has ~800MB.
Anyway, without MARVEL the other tools finished:
executor > lsf (11)
[98/7a3b76] process > virsorter_wf:input_suffix_check (1) [100%] 1 of 1, cached: 1 ✔
[9f/df06bf] process > virsorter_wf:virsorter (1) [100%] 1 of 1 ✔
[14/799570] process > virsorter_wf:filter_virsorter (1) [100%] 1 of 1 ✔
[b6/09fb1a] process > metaphinder_wf:input_suffix_check (1) [100%] 1 of 1, cached: 1 ✔
[6d/67c4c5] process > metaphinder_wf:metaphinder (1) [100%] 1 of 1 ✔
[bf/0ccce7] process > metaphinder_wf:filter_metaphinder (1) [100%] 1 of 1 ✔
[0f/959cf6] process > deepvirfinder_wf:input_suffix_check (1) [100%] 1 of 1, cached: 1 ✔
[27/e0e1b0] process > deepvirfinder_wf:deepvirfinder (1) [100%] 1 of 1 ✔
[18/19aba7] process > deepvirfinder_wf:filter_deepvirfinder (1) [100%] 1 of 1 ✔
[e4/6043c0] process > virfinder_wf:input_suffix_check (1) [100%] 1 of 1, cached: 1 ✔
[02/d913c2] process > virfinder_wf:virfinder (1) [100%] 1 of 1 ✔
[4e/9a7dae] process > virfinder_wf:filter_virfinder (1) [100%] 1 of 1 ✔
[ff/c1ae1e] process > pprmeta_wf:input_suffix_check (1) [100%] 1 of 1, cached: 1 ✔
[b6/d84f86] process > pprmeta_wf:pprmeta (1) [100%] 1 of 1 ✔
[88/63175e] process > pprmeta_wf:filter_PPRmeta (1) [100%] 1 of 1 ✔
[1b/db967e] process > r_plot (1) [100%] 1 of 1, failed: 1 ✘
WARN: Input tuple does not match input set cardinality declared by process `r_plot` -- offending value: [SRR8811960_1.unclassified, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/14/79957041d5d5f010b6f91cbc1dd3d6/virsorter.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/bf/0ccce79516c6e5cc0f6ee9c79fa3c1/metaphinder.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/18/19aba75a38d821c5d00ccace25a2e1/deepvirfinder.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/4e/9a7daece0c8103633a90b0c1da8cb1/virfinder.txt, /hps/nobackup2/production/metagenomics/mhoelzer/nextflow-work-mhoelzer/88/63175e2b934b78d008e67446792c3e/PPRmeta.txt]
And just the r_plot did not work. But maybe because the MARVEL output is missing?
Error executing process > 'r_plot (1)'
Caused by:
Process `r_plot (1)` terminated with an error exit status (1)
Command executed:
convert.sh
heatmap.R summary.csv
Command exit status:
1
Command output:
(empty)
Command error:
sort: cannot create temporary file in '/scratch': No such file or directory
sort: cannot create temporary file in '/scratch': No such file or directory
sort: cannot create temporary file in '/scratch': No such file or directory
Attaching package: ‘dplyr’
The following object is masked from ‘package:gridExtra’:
combine
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘tidyr’
The following object is masked from ‘package:reshape2’:
smiths
Error in FUN(X[[i]], ...) : object 'variable' not found
Calls: <Anonymous> ... <Anonymous> -> f -> scales_add_defaults -> lapply -> FUN
Execution halted
I just had a rough look at the VirSorter output: 220 phage sequences and mostly cat-2.
@r_plot error: nope, the problem is not that the MARVEL output is missing, tested it.
yes its the missing marvel output. it would be good to know if its contig size or amount. because the amount would not be an issue, because raw reads will get a separate file "entry" (--fastq) and a samtool and fastq to fasta conversion. but if its the size ill need to adjust MARVEL accordingly
@missing marvel output, ahh ok because your R script is not dynamic in the sense of input files? ok.
@contig size or amount: yeah, we have to investigate this. Can't say at the moment without further testing.
@missing: I tried it as a groupTuple() to avoid an explicit input channel (my scripts don't care for instance - the files only have to be present). but somehow I got strange errors with nextflow. So it's more a channel handling issue :D
if its based on "amount of input" it should be fixed in a99a759 with the --fastq
input.
reopen if this error happens again
I just tried this input file:
https://github.com/hoelzer/CWL_viral_pipeline/tree/master/CWL/Files_for_test
and run again into this error:
Error executing process > 'marvel_wf:marvel (1)'
Caused by:
Failed to parse template script (your template may contain an error or be trying to use expressions not currently supported): org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/groovy/script/ScriptE9E90C6982D77ECE92405BEF7E1F77D3: 1: String too long. The given string is 82394 Unicode code units long, but only a maximum of 65535 is allowed.
@ line 1, column 16.
__$$_out.print("""
^
1 error
Source block:
"""
rnd=${Math.random()}
mkdir fasta_dir_${name}
cp ${fasta} fasta_dir_${name}/
# Marvel
marvel_bins.py -i fasta_dir_${name} -t ${params.cpus} > results.txt
# getting contig names
filenames=\$(grep "${name}\\." results.txt | cut -f2 -d " ")
while IFS= read -r samplename ; do
head -1 fasta_dir_${name}/\${samplename}.fa >> ${name}_\${rnd//0.}.list
done < <(printf '%s\n' "\${filenames}")
"""
There are indeed some long contigs:
(base) ➜ What_the_Phage git:(master) grep ">" /homes/mhoelzer/backuped/git/CWL_viral_pipeline/CWL/Files_for_test/ERR575691_host_filtered.fasta | head
>NODE_1_length_576855_cov_17.155733
>NODE_2_length_264378_cov_11.585458
>NODE_3_length_109381_cov_8.571337
>NODE_4_length_86514_cov_9.751212
>NODE_5_length_65088_cov_10.862885
>NODE_6_length_61270_cov_3.963653
>NODE_7_length_53681_cov_7.396207
>NODE_8_length_46689_cov_12.914976
>NODE_9_length_44408_cov_5.457669
>NODE_10_length_41715_cov_27633.072156
As a test, I removed all sequences larger 100k nt: still the error.
Then I removed all larger 50k nt. Still.
Then I removed a bunch of short (< ~1.2k) sequences and then it was running.
Next, I used the original file again (with the very long sequences) and just removed all contigs smaller 500nt. And it was working. I dont know what the minimum is that MARVEL accepts, but it seems there should be an initial length filter step included to the pipeline.
I just tested removing everything smaller 100 nt: error occurs again. smaller 150: error smaller 200: error smaller 300: its running
By removing all contigs smaller 300nt I reduced the number of contigs from 2593
to 1151
As another test, I removed the longest contigs until I only got a file of 1578
contigs (including all the short ones). and got some other weird error.
I would suggest implementing a length filter (at least for MARVEL) with a cutoff of 300nt. What do you think?
i think i identified the error, and what it is talking about @hoelzer ?
grep ">" ERR575691_host_filtered.fasta | wc -c
gives me 89773.
So its talking about the total size of all contig headers. it has nothing to do with the amount (only indirectly - more read names more characters) or the sequences.
Error Message:
String too long. The given string is 82394 Unicode code units long
It is not a Marvel Error btw. its an error from the .splitFasta
statement. the solution would be to do a process instead of the filter.
the workaround would be to break your file into 2 parts and then put it into the workflow. until I rewrite that part. If the splitting the file into 2 part works - tell me because then its most definitely this issue.
it is because marvel wants each contig as a separate file, to work like a tool to analyze "bins" (a folder with lots of fasta files). so we create out of a fasta or fastq files one file per contig/read and simulate a bin dir. then we know the information per contig. instead of a "fasta contains phage"
Or at least we wrote it this way to be more useful.
Oh snap. Ok, I see. I will test splitting the file into two chunks and report.
@replikation ok this is a bit odd and I am nut sure what is going on
I split the file into smaller chunks of 500 sequences, actually with this nice awk command:
awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%500==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < ERR575691_host_filtered.fasta
Please find here one of the chunked fasta files: myseq1500.fa.zip
that gives me now this error:
Error executing process > 'marvel_wf:marvel (1)'
Caused by:
Process `marvel_wf:marvel (1)` terminated with an error exit status (1)
Command executed:
rnd=0.6634756290061703
mkdir fasta_dir_myseq1500
cp myseq1500.1.fa myseq1500.2.fa myseq1500.3.fa myseq1500.4.fa myseq1500.5.fa myseq1500.6.fa myseq1500.7.fa myseq1500.8.fa myseq1500.9.fa myseq1500.10.fa myseq1500.11.fa myseq1500.12.fa myseq1500.13.fa myseq1500.14.fa myseq1500.15.fa myseq1500.16.fa myseq1500.17.fa myseq1500.18.fa myseq1500.19.fa myseq1500.20.fa myseq1500.21.fa myseq1500.22.fa myseq1500.23.fa myseq1500.24.fa myseq1500.25.fa myseq1500.26.fa myseq1500.27.fa myseq1500.28.fa myseq1500.29.fa myseq1500.30.fa myseq1500.31.fa myseq1500.32.fa myseq1500.33.fa myseq1500.34.fa myseq1500.35.fa myseq1500.36.fa myseq1500.37.fa myseq1500.38.fa myseq1500.39.fa myseq1500.40.fa myseq1500.41.fa myseq1500.42.fa myseq1500.43.fa myseq1500.44.fa myseq1500.45.fa myseq1500.46.fa myseq1500.47.fa myseq1500.48.fa myseq1500.49.fa myseq1500.50.fa myseq1500.51.fa myseq1500.52.fa myseq1500.53.fa myseq1500.54.fa myseq1500.55.fa myseq1500.56.fa myseq1500.57.fa myseq1500.58.fa myseq1500.59.fa myseq1500.60.fa myseq1500.61.fa myseq1500.62.fa myseq1500.63.fa myseq1500.64.fa myseq1500.65.fa myseq1500.66.fa myseq1500.67.fa myseq1500.68.fa myseq1500.69.fa myseq1500.70.fa myseq1500.71.fa myseq1500.72.fa myseq1500.73.fa myseq1500.74.fa myseq1500.75.fa myseq1500.76.fa myseq1500.77.fa myseq1500.78.fa myseq1500.79.fa myseq1500.80.fa myseq1500.81.fa myseq1500.82.fa myseq1500.83.fa myseq1500.84.fa myseq1500.85.fa myseq1500.86.fa myseq1500.87.fa myseq1500.88.fa myseq1500.89.fa myseq1500.90.fa myseq1500.91.fa myseq1500.92.fa myseq1500.93.fa myseq1500.94.fa myseq1500.95.fa myseq1500.96.fa myseq1500.97.fa myseq1500.98.fa myseq1500.99.fa myseq1500.100.fa myseq1500.101.fa myseq1500.102.fa myseq1500.103.fa myseq1500.104.fa myseq1500.105.fa myseq1500.106.fa myseq1500.107.fa myseq1500.108.fa myseq1500.109.fa myseq1500.110.fa myseq1500.111.fa myseq1500.112.fa myseq1500.113.fa myseq1500.114.fa myseq1500.115.fa myseq1500.116.fa myseq1500.117.fa myseq1500.118.fa myseq1500.119.fa myseq1500.120.fa myseq1500.121.fa myseq1500.122.fa myseq1500.123.fa myseq1500.124.fa myseq1500.125.fa myseq1500.126.fa myseq1500.127.fa myseq1500.128.fa myseq1500.129.fa myseq1500.130.fa myseq1500.131.fa myseq1500.132.fa myseq1500.133.fa myseq1500.134.fa myseq1500.135.fa myseq1500.136.fa myseq1500.137.fa myseq1500.138.fa myseq1500.139.fa myseq1500.140.fa myseq1500.141.fa myseq1500.142.fa myseq1500.143.fa myseq1500.144.fa myseq1500.145.fa myseq1500.146.fa myseq1500.147.fa myseq1500.148.fa myseq1500.149.fa myseq1500.150.fa myseq1500.151.fa myseq1500.152.fa myseq1500.153.fa myseq1500.154.fa myseq1500.155.fa myseq1500.156.fa myseq1500.157.fa myseq1500.158.fa myseq1500.159.fa myseq1500.160.fa myseq1500.161.fa myseq1500.162.fa myseq1500.163.fa myseq1500.164.fa myseq1500.165.fa myseq1500.166.fa myseq1500.167.fa myseq1500.168.fa myseq1500.169.fa myseq1500.170.fa myseq1500.171.fa myseq1500.172.fa myseq1500.173.fa myseq1500.174.fa myseq1500.175.fa myseq1500.176.fa myseq1500.177.fa myseq1500.178.fa myseq1500.179.fa myseq1500.180.fa myseq1500.181.fa myseq1500.182.fa myseq1500.183.fa myseq1500.184.fa myseq1500.185.fa myseq1500.186.fa myseq1500.187.fa myseq1500.188.fa myseq1500.189.fa myseq1500.190.fa myseq1500.191.fa myseq1500.192.fa myseq1500.193.fa myseq1500.194.fa myseq1500.195.fa myseq1500.196.fa myseq1500.197.fa myseq1500.198.fa myseq1500.199.fa myseq1500.200.fa myseq1500.201.fa myseq1500.202.fa myseq1500.203.fa myseq1500.204.fa myseq1500.205.fa myseq1500.206.fa myseq1500.207.fa myseq1500.208.fa myseq1500.209.fa myseq1500.210.fa myseq1500.211.fa myseq1500.212.fa myseq1500.213.fa myseq1500.214.fa myseq1500.215.fa myseq1500.216.fa myseq1500.217.fa myseq1500.218.fa myseq1500.219.fa myseq1500.220.fa myseq1500.221.fa myseq1500.222.fa myseq1500.223.fa myseq1500.224.fa myseq1500.225.fa myseq1500.226.fa myseq1500.227.fa myseq1500.228.fa myseq1500.229.fa myseq1500.230.fa myseq1500.231.fa myseq1500.232.fa myseq1500.233.fa myseq1500.234.fa myseq1500.235.fa myseq1500.236.fa myseq1500.237.fa myseq1500.238.fa myseq1500.239.fa myseq1500.240.fa myseq1500.241.fa myseq1500.242.fa myseq1500.243.fa myseq1500.244.fa myseq1500.245.fa myseq1500.246.fa myseq1500.247.fa myseq1500.248.fa myseq1500.249.fa myseq1500.250.fa myseq1500.251.fa myseq1500.252.fa myseq1500.253.fa myseq1500.254.fa myseq1500.255.fa myseq1500.256.fa myseq1500.257.fa myseq1500.258.fa myseq1500.259.fa myseq1500.260.fa myseq1500.261.fa myseq1500.262.fa myseq1500.263.fa myseq1500.264.fa myseq1500.265.fa myseq1500.266.fa myseq1500.267.fa myseq1500.268.fa myseq1500.269.fa myseq1500.270.fa myseq1500.271.fa myseq1500.272.fa myseq1500.273.fa myseq1500.274.fa myseq1500.275.fa myseq1500.276.fa myseq1500.277.fa myseq1500.278.fa myseq1500.279.fa myseq1500.280.fa myseq1500.281.fa myseq1500.282.fa myseq1500.283.fa myseq1500.284.fa myseq1500.285.fa myseq1500.286.fa myseq1500.287.fa myseq1500.288.fa myseq1500.289.fa myseq1500.290.fa myseq1500.291.fa myseq1500.292.fa myseq1500.293.fa myseq1500.294.fa myseq1500.295.fa myseq1500.296.fa myseq1500.297.fa myseq1500.298.fa myseq1500.299.fa myseq1500.300.fa myseq1500.301.fa myseq1500.302.fa myseq1500.303.fa myseq1500.304.fa myseq1500.305.fa myseq1500.306.fa myseq1500.307.fa myseq1500.308.fa myseq1500.309.fa myseq1500.310.fa myseq1500.311.fa myseq1500.312.fa myseq1500.313.fa myseq1500.314.fa myseq1500.315.fa myseq1500.316.fa myseq1500.317.fa myseq1500.318.fa myseq1500.319.fa myseq1500.320.fa myseq1500.321.fa myseq1500.322.fa myseq1500.323.fa myseq1500.324.fa myseq1500.325.fa myseq1500.326.fa myseq1500.327.fa myseq1500.328.fa myseq1500.329.fa myseq1500.330.fa myseq1500.331.fa myseq1500.332.fa myseq1500.333.fa myseq1500.334.fa myseq1500.335.fa myseq1500.336.fa myseq1500.337.fa myseq1500.338.fa myseq1500.339.fa myseq1500.340.fa myseq1500.341.fa myseq1500.342.fa myseq1500.343.fa myseq1500.344.fa myseq1500.345.fa myseq1500.346.fa myseq1500.347.fa myseq1500.348.fa myseq1500.349.fa myseq1500.350.fa myseq1500.351.fa myseq1500.352.fa myseq1500.353.fa myseq1500.354.fa myseq1500.355.fa myseq1500.356.fa myseq1500.357.fa myseq1500.358.fa myseq1500.359.fa myseq1500.360.fa myseq1500.361.fa myseq1500.362.fa myseq1500.363.fa myseq1500.364.fa myseq1500.365.fa myseq1500.366.fa myseq1500.367.fa myseq1500.368.fa myseq1500.369.fa myseq1500.370.fa myseq1500.371.fa myseq1500.372.fa myseq1500.373.fa myseq1500.374.fa myseq1500.375.fa myseq1500.376.fa myseq1500.377.fa myseq1500.378.fa myseq1500.379.fa myseq1500.380.fa myseq1500.381.fa myseq1500.382.fa myseq1500.383.fa myseq1500.384.fa myseq1500.385.fa myseq1500.386.fa myseq1500.387.fa myseq1500.388.fa myseq1500.389.fa myseq1500.390.fa myseq1500.391.fa myseq1500.392.fa myseq1500.393.fa myseq1500.394.fa myseq1500.395.fa myseq1500.396.fa myseq1500.397.fa myseq1500.398.fa myseq1500.399.fa myseq1500.400.fa myseq1500.401.fa myseq1500.402.fa myseq1500.403.fa myseq1500.404.fa myseq1500.405.fa myseq1500.406.fa myseq1500.407.fa myseq1500.408.fa myseq1500.409.fa myseq1500.410.fa myseq1500.411.fa myseq1500.412.fa myseq1500.413.fa myseq1500.414.fa myseq1500.415.fa myseq1500.416.fa myseq1500.417.fa myseq1500.418.fa myseq1500.419.fa myseq1500.420.fa myseq1500.421.fa myseq1500.422.fa myseq1500.423.fa myseq1500.424.fa myseq1500.425.fa myseq1500.426.fa myseq1500.427.fa myseq1500.428.fa myseq1500.429.fa myseq1500.430.fa myseq1500.431.fa myseq1500.432.fa myseq1500.433.fa myseq1500.434.fa myseq1500.435.fa myseq1500.436.fa myseq1500.437.fa myseq1500.438.fa myseq1500.439.fa myseq1500.440.fa myseq1500.441.fa myseq1500.442.fa myseq1500.443.fa myseq1500.444.fa myseq1500.445.fa myseq1500.446.fa myseq1500.447.fa myseq1500.448.fa myseq1500.449.fa myseq1500.450.fa myseq1500.451.fa myseq1500.452.fa myseq1500.453.fa myseq1500.454.fa myseq1500.455.fa myseq1500.456.fa myseq1500.457.fa myseq1500.458.fa myseq1500.459.fa myseq1500.460.fa myseq1500.461.fa myseq1500.462.fa myseq1500.463.fa myseq1500.464.fa myseq1500.465.fa myseq1500.466.fa myseq1500.467.fa myseq1500.468.fa myseq1500.469.fa myseq1500.470.fa myseq1500.471.fa myseq1500.472.fa myseq1500.473.fa myseq1500.474.fa myseq1500.475.fa myseq1500.476.fa myseq1500.477.fa myseq1500.478.fa myseq1500.479.fa myseq1500.480.fa myseq1500.481.fa myseq1500.482.fa myseq1500.483.fa myseq1500.484.fa myseq1500.485.fa myseq1500.486.fa myseq1500.487.fa myseq1500.488.fa myseq1500.489.fa myseq1500.490.fa myseq1500.491.fa myseq1500.492.fa myseq1500.493.fa myseq1500.494.fa myseq1500.495.fa myseq1500.496.fa myseq1500.497.fa myseq1500.498.fa myseq1500.499.fa myseq1500.500.fa fasta_dir_myseq1500/
# Marvel
marvel_bins.py -i fasta_dir_myseq1500 -t 8 > results.txt
# getting contig names
filenames=$(grep "myseq1500\." results.txt | cut -f2 -d " ")
while IFS= read -r samplename ; do
head -1 fasta_dir_myseq1500/${samplename}.fa >> myseq1500_${rnd//0.}.list
done < <(printf '%s
' "${filenames}")
Command exit status:
1
Command output:
(empty)
Command error:
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Traceback (most recent call last):
File "/MARVEL/marvel_bins.py", line 317, in <module>
y_test_predicted_pickle = pickle_model.predict(array_bins[:, ])
File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 545, in predict
proba = self.predict_proba(X)
File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 588, in predict_proba
X = self._validate_X_predict(X)
File "/usr/local/lib/python3.6/dist-packages/sklearn/ensemble/forest.py", line 359, in _validate_X_predict
return self.estimators_[0]._validate_X_predict(X, check_input=True)
File "/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py", line 391, in _validate_X_predict
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 521, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
btw same issue on my local machine (macbook)
I think this is related to the way marvel module is implemented. For larger files (~10k sequences) I get (on LSF cluster and local machine) errors like:
Error executing process > 'marvel_wf:marvel (3)'
Caused by:
Process `marvel_wf:marvel (3)` terminated with an error exit status (1)
Command executed:
rnd=0.2659904844845017
mkdir fasta_dir_ERR575691_host_filtered_filt500bp
cp ERR575691_host_filtered_filt500bp.1.fa ERR575691_host_filtered_filt500bp.2.fa ERR575691_host_filtered_filt500bp.3.fa ERR575691_host_filtered_filt500bp.4.fa ERR575691_host_filtered_filt500bp.5.fa ERR575691_host_filtered_filt500bp.6.fa ERR575691_host_filtered_filt500bp.7.fa ERR575691_host_filtered_filt500bp.8.fa ERR575691_host_filtered_filt500bp.9.fa ERR575691_host_filtered_filt500bp.10.fa ERR575691_host_filtered_filt500bp.11.fa ERR575691_host_filtered_filt500bp.12.fa ERR575691_host_filtered_filt500bp.13.fa ERR575691_host_filtered_filt500bp.14.fa ERR575691_host_filtered_filt500bp.15.fa ERR575691_host_filtered_filt500bp.16.fa ERR575691_host_filtered_filt500bp.17.fa ERR575691_host_filtered_filt500bp.18.fa ERR575691_host_filtered_filt500bp.19.fa ERR575691_host_filtered_filt500bp.20.fa ERR575691_host_filtered_filt500bp.21.fa ERR575691_host_filtered_filt500bp.22.fa ERR575691_host_filtered_filt500bp.23.fa ... ERR575691_host_filtered_filt500bp.491.fa ERR575691_host_filtered_filt500bp.492.fa ERR575691_host_filtered_filt500bp.493.fa ERR575691_host_filtered_filt500bp.494.fa ERR575691_host_filtered_filt500bp.495.fa ERR575691_host_filtered_filt500bp.496.fa ERR575691_host_filtered_filt500bp.497.fa ERR575691_host_filtered_filt500bp.498.fa ERR575691_host_filtered_filt500bp.499.fa ERR575691_host_filtered_filt500bp.500.fa ERR575691_host_filtered_filt500bp.501.fa ERR575691_host_filtered_filt500bp.502.fa ERR575691_host_filtered_filt500bp.503.fa fasta_dir_ERR575691_host_filtered_filt500bp/
# Marvel
marvel_bins.py -i fasta_dir_ERR575691_host_filtered_filt500bp -t 8 > results.txt
# getting contig names
filenames=$(grep "ERR575691_host_filtered_filt500bp\." results.txt | cut -f2 -d " ")
while IFS= read -r samplename ; do
head -1 fasta_dir_ERR575691_host_filtered_filt500bp/${samplename}.fa >> ERR575691_host_filtered_filt500bp_${rnd//0.}.list
done < <(printf '%s
' "${filenames}")
Command exit status:
1
Command output:
(empty)
Command error:
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/base.py:306: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.19.1 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
head: cannot open 'fasta_dir_ERR575691_host_filtered_filt500bp/has.fa' for reading: No such file or directory
@hoelzer we identified the bug, it was just an issue with 100 % negative files we fixed this. WtP now runs completely with files who dont have any phages.
Nothing important for now, but I was curious and just threw some Nanopore reads into the workflow, MARVEL reported:
So it seems a length filter needs to be applied for MARVEL when used with long sequences. Are actually all input test data sets smaller than this size? I mean, it is totally possible that someone just starts the workflow with a FASTA file containing longer sequences.