dplyr error in R script

JSSaini commented 5 months ago

Hi, I am running the provided MAGScoT R script and I am getting the following error at line 3.

Rscript ../github/MAGScoT/MAGScoT.R -i S21.contigs_to_bin_R_change2.tsv --hmm S21.hmm

Loading packages...
Error in `filter()`:
ℹ In argument: `!is.na(set)`.
Caused by error:
! object 'set' not found
Backtrace:
     ▆
  1. ├─contig_to_bin %>% filter(!is.na(set))
  2. ├─dplyr::filter(., !is.na(set))
  3. ├─dplyr:::filter.data.frame(., !is.na(set))
  4. │ └─dplyr:::filter_rows(.data, dots, by)
  5. │   └─dplyr:::filter_eval(...)
  6. │     ├─base::withCallingHandlers(...)
  7. │     └─mask$eval_all_filter(dots, env_filter)
  8. │       └─dplyr (local) eval()
  9. └─base::.handleSimpleError(`<fn>`, "object 'set' not found", base::quote(NULL))
 10.   └─dplyr (local) h(simpleError(msg, call))
 11.     └─rlang::abort(message, class = error_class, parent = parent, call = error_call)
Execution halted

This is how my input files look like:

 -i S21.contigs_to_bin_R_change2.tsv
METABAT__101-contigs.fa  S21_c_000000000110  Metabat
METABAT__101-contigs.fa  S21_c_000000014298  Metabat
METABAT__101-contigs.fa  S21_c_000000026485  Metabat
METABAT__101-contigs.fa  S21_c_000000034973  Metabat
METABAT__101-contigs.fa  S21_c_000000000360  Metabat

--hmm S21.hmm
S21_c_000000049715_2    PF00380.20      6.9e-47
S21_c_000000010177_4    PF00380.20      2e-46
S21_c_000000037432_3    PF00380.20      2.2e-46
S21_c_000000000100_50   PF00380.20      7.2e-46
S21_c_000000002671_15   PF00380.20      1.3e-45

I have attached both my input files incase you would like to try at your end.

MAGScoT_input_files.zip

Thank you for your assistance.

JSSaini commented 5 months ago

Hmm file is created using metagenomic assemblies from which MAGs are obtained with different binning algorithms.

eikematthias commented 5 months ago

Hi, your input file S21.contigs_to_bin_R_change2.tsv does not look tab separated at first sight. Could you please recheck if this might be the issue?

JSSaini commented 5 months ago

Rscript ../github/MAGScoT/MAGScoT.R -i S21.contigs_to_bin_R_change4.tsv --hmm S21.hmm

Loading packages...
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat) 
Bisco iteration:  1 
Exiting bin merging, no (more) overlaps found
Refining bins, iteration: 1 
Extracting SCG information for bins...
Scores for all initial bins written to: MAGScoT.scores.out 
The highest scoring bin is: METABAT__113-contigs.fa 
Remaining: 1029 contigs and 4 candidate bins
The highest scoring bin is: METABAT__101-contigs.fa 
Remaining: 1029 contigs and 3 candidate bins
The highest scoring bin is: METABAT__112-contigs.fa 
Remaining: 1029 contigs and 2 candidate bins
The highest scoring bin is: METABAT__104-contigs.fa 
Remaining: 1029 contigs and 1 candidate bins
Refinement lead to a total of 4  bins with a score >= 0.5 
Refinement stats are written to: MAGScoT.refined.out 
Contig-to-refined-bin mapping is written to: MAGScoT.refined.contig_to_bin.out

Thank you for your reply. I corrected the file. Now it finished with the warning. This is the expected outcome files? Also, how to make fa file from this output?

JSSaini commented 5 months ago

Hi, I used the following script and obtained only four clean bins.

cat MAGScoT.refined.contig_to_bin.out | awk '{if(NR==1){print "contig_id,cluster_id"; next}; print $2","$1}' | sed 's/[.]fasta//' | extract_fasta_bins.py S21_contigs_anvio_filter2500.fasta /dev/stdin --output_path cleanbins

(anvio-8) [saini@jed cleanbins]$ ls -lht
total 13M
-rw-r--r-- 1 saini lbe-unit 2.3M May  9 18:59 MAGScoT_cleanbin_000004.fa
-rw-r--r-- 1 saini lbe-unit 2.3M May  9 18:59 MAGScoT_cleanbin_000003.fa
-rw-r--r-- 1 saini lbe-unit 4.9M May  9 18:59 MAGScoT_cleanbin_000002.fa
-rw-r--r-- 1 saini lbe-unit 3.1M May  9 18:59 MAGScoT_cleanbin_000001.fa

This does not seems to be normal from 326.75 MB assembly.

stats.sh S21_contigs_anvio_filter2500.fasta
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.2362  0.2632  0.2633  0.2373  0.0000  0.0000  0.0000  0.5265  0.1208

Main genome scaffold total:             50443
Main genome contig total:               50443
Main genome scaffold sequence total:    326.755 MB
Main genome contig sequence total:      326.755 MB      0.000% gap
Main genome scaffold N/L50:             8999/7.617 KB
Main genome contig N/L50:               8999/7.617 KB
Main genome scaffold N/L90:             38467/2.992 KB
Main genome contig N/L90:               38467/2.992 KB
Max scaffold length:                    1.279 MB
Max contig length:                      1.279 MB
Number of scaffolds > 50 KB:            354
% main genome in scaffolds > 50 KB:     11.52%

Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All             50,443          50,443     326,754,663     326,754,663   100.00%
   1 KB             50,443          50,443     326,754,663     326,754,663   100.00%
 2.5 KB             50,443          50,443     326,754,663     326,754,663   100.00%
   5 KB             17,409          17,409     214,494,902     214,494,902   100.00%
  10 KB              5,834           5,834     135,913,449     135,913,449   100.00%
  25 KB              1,300           1,300      69,090,113      69,090,113   100.00%
  50 KB                354             354      37,648,997      37,648,997   100.00%
 100 KB                105             105      20,855,227      20,855,227   100.00%
 250 KB                 20              20       8,814,461       8,814,461   100.00%
 500 KB                  5               5       3,826,317       3,826,317   100.00%
   1 MB                  1               1       1,278,994       1,278,994   100.00%

JSSaini commented 5 months ago

MAGScot.results.tar.gz

I have attached the results along with the input files. Kindly have a look.

eikematthias commented 5 months ago

Hi, your inputs still look messy. Your contigs_to_bin file contains different amounts of delimiters where there should only be three columns, see here:

METABAT__95-contigs.fa,,S21_c_000000027719,Metabat
METABAT__95-contigs.fa,,S21_c_000000020613,Metabat
SemiBin_0.fa,,,,,,,S21_c_000000000001,SemiBin
SemiBin_0.fa,,,,,,,S21_c_000000000002,SemiBin

I would assume that this might also cause your processing steps to fail for those and that is why you get so few bins as a result. So please make sure that your input files are formatted correctly.

mruehlemann commented 5 months ago

Hi!

@eikematthias is absolutely correct, there is some misformatting in the input TSV file. I used the version in the provided tar file to create a working version:

sed -r 's/[,]+/\t/g' S21.contigs_to_bin_R_change3.tsv > S21.contigs_to_bin_R_change4.tsv

In addition, it is important that bin-IDs between binners are unique. For you, this is not the case for the two different SemiBin2 runs, otherwise MAGScoT will lump things with the same name together, irrespective of the source (will be addressed in future versions). I attached the "_TM" ending to the bin names in column 1 for all bins from SemiBin_TM:

awk '{if(length($3)>8){print $1"_TM\t"$2"\t"$3}else{print}}' S21.contigs_to_bin_R_change4.tsv > S21.contigs_to_bin_R_change5.tsv

S21.contigs_to_bin_R_change5.tsv.zip

Using this as input, MAGScoT runs through and creates 61 bins with scores > 0.5 using MAGScoT's default scoring:

Rscript MAGScoT.R -i S21.contigs_to_bin_R_change5.tsv --hmm S21.hmm -o S21_new2 > S21_new2.log

I attached all outputs to this message: S21.contigs_to_bin_R_change5.tsv.zip

JSSaini commented 5 months ago

@mruehlemann @eikematthias Indeed I see the errors whilst creating tab-delimited file. Thank you for assisting with the troubleshooting and providing the correct steps. Perhaps it would be beneficial to include a section in the documentation explaining how users can prepare MAGscot compatible tab-delimited file once they have fasta files from multiple binning outputs. I will close this thread. Thank you. Kind regards, Jaspreet

ikmb / MAGScoT

dplyr error in R script #6