Closed JSSaini closed 5 months ago
Hmm file is created using metagenomic assemblies from which MAGs are obtained with different binning algorithms.
Hi, your input file S21.contigs_to_bin_R_change2.tsv does not look tab separated at first sight. Could you please recheck if this might be the issue?
Rscript ../github/MAGScoT/MAGScoT.R -i S21.contigs_to_bin_R_change4.tsv --hmm S21.hmm
Loading packages...
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
Bisco iteration: 1
Exiting bin merging, no (more) overlaps found
Refining bins, iteration: 1
Extracting SCG information for bins...
Scores for all initial bins written to: MAGScoT.scores.out
The highest scoring bin is: METABAT__113-contigs.fa
Remaining: 1029 contigs and 4 candidate bins
The highest scoring bin is: METABAT__101-contigs.fa
Remaining: 1029 contigs and 3 candidate bins
The highest scoring bin is: METABAT__112-contigs.fa
Remaining: 1029 contigs and 2 candidate bins
The highest scoring bin is: METABAT__104-contigs.fa
Remaining: 1029 contigs and 1 candidate bins
Refinement lead to a total of 4 bins with a score >= 0.5
Refinement stats are written to: MAGScoT.refined.out
Contig-to-refined-bin mapping is written to: MAGScoT.refined.contig_to_bin.out
Thank you for your reply. I corrected the file. Now it finished with the warning. This is the expected outcome files? Also, how to make fa file from this output?
Hi, I used the following script and obtained only four clean bins.
cat MAGScoT.refined.contig_to_bin.out | awk '{if(NR==1){print "contig_id,cluster_id"; next}; print $2","$1}' | sed 's/[.]fasta//' | extract_fasta_bins.py S21_contigs_anvio_filter2500.fasta /dev/stdin --output_path cleanbins
(anvio-8) [saini@jed cleanbins]$ ls -lht
total 13M
-rw-r--r-- 1 saini lbe-unit 2.3M May 9 18:59 MAGScoT_cleanbin_000004.fa
-rw-r--r-- 1 saini lbe-unit 2.3M May 9 18:59 MAGScoT_cleanbin_000003.fa
-rw-r--r-- 1 saini lbe-unit 4.9M May 9 18:59 MAGScoT_cleanbin_000002.fa
-rw-r--r-- 1 saini lbe-unit 3.1M May 9 18:59 MAGScoT_cleanbin_000001.fa
This does not seems to be normal from 326.75 MB assembly.
stats.sh S21_contigs_anvio_filter2500.fasta
A C G T N IUPAC Other GC GC_stdev
0.2362 0.2632 0.2633 0.2373 0.0000 0.0000 0.0000 0.5265 0.1208
Main genome scaffold total: 50443
Main genome contig total: 50443
Main genome scaffold sequence total: 326.755 MB
Main genome contig sequence total: 326.755 MB 0.000% gap
Main genome scaffold N/L50: 8999/7.617 KB
Main genome contig N/L50: 8999/7.617 KB
Main genome scaffold N/L90: 38467/2.992 KB
Main genome contig N/L90: 38467/2.992 KB
Max scaffold length: 1.279 MB
Max contig length: 1.279 MB
Number of scaffolds > 50 KB: 354
% main genome in scaffolds > 50 KB: 11.52%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 50,443 50,443 326,754,663 326,754,663 100.00%
1 KB 50,443 50,443 326,754,663 326,754,663 100.00%
2.5 KB 50,443 50,443 326,754,663 326,754,663 100.00%
5 KB 17,409 17,409 214,494,902 214,494,902 100.00%
10 KB 5,834 5,834 135,913,449 135,913,449 100.00%
25 KB 1,300 1,300 69,090,113 69,090,113 100.00%
50 KB 354 354 37,648,997 37,648,997 100.00%
100 KB 105 105 20,855,227 20,855,227 100.00%
250 KB 20 20 8,814,461 8,814,461 100.00%
500 KB 5 5 3,826,317 3,826,317 100.00%
1 MB 1 1 1,278,994 1,278,994 100.00%
I have attached the results along with the input files. Kindly have a look.
Hi, your inputs still look messy. Your contigs_to_bin file contains different amounts of delimiters where there should only be three columns, see here:
METABAT__95-contigs.fa,,S21_c_000000027719,Metabat
METABAT__95-contigs.fa,,S21_c_000000020613,Metabat
SemiBin_0.fa,,,,,,,S21_c_000000000001,SemiBin
SemiBin_0.fa,,,,,,,S21_c_000000000002,SemiBin
I would assume that this might also cause your processing steps to fail for those and that is why you get so few bins as a result. So please make sure that your input files are formatted correctly.
Hi!
@eikematthias is absolutely correct, there is some misformatting in the input TSV file. I used the version in the provided tar file to create a working version:
sed -r 's/[,]+/\t/g' S21.contigs_to_bin_R_change3.tsv > S21.contigs_to_bin_R_change4.tsv
In addition, it is important that bin-IDs between binners are unique. For you, this is not the case for the two different SemiBin2 runs, otherwise MAGScoT will lump things with the same name together, irrespective of the source (will be addressed in future versions). I attached the "_TM" ending to the bin names in column 1 for all bins from SemiBin_TM:
awk '{if(length($3)>8){print $1"_TM\t"$2"\t"$3}else{print}}' S21.contigs_to_bin_R_change4.tsv > S21.contigs_to_bin_R_change5.tsv
S21.contigs_to_bin_R_change5.tsv.zip
Using this as input, MAGScoT runs through and creates 61 bins with scores > 0.5 using MAGScoT's default scoring:
Rscript MAGScoT.R -i S21.contigs_to_bin_R_change5.tsv --hmm S21.hmm -o S21_new2 > S21_new2.log
I attached all outputs to this message: S21.contigs_to_bin_R_change5.tsv.zip
@mruehlemann @eikematthias Indeed I see the errors whilst creating tab-delimited file. Thank you for assisting with the troubleshooting and providing the correct steps. Perhaps it would be beneficial to include a section in the documentation explaining how users can prepare MAGscot compatible tab-delimited file once they have fasta files from multiple binning outputs. I will close this thread. Thank you. Kind regards, Jaspreet
Hi, I am running the provided MAGScoT R script and I am getting the following error at line 3.
Rscript ../github/MAGScoT/MAGScoT.R -i S21.contigs_to_bin_R_change2.tsv --hmm S21.hmm
This is how my input files look like:
I have attached both my input files incase you would like to try at your end.
MAGScoT_input_files.zip
Thank you for your assistance.