Open AlessioMilanese opened 2 years ago
Seconded. Am having the same problem, and was about to post : )
Note: I realised that you need reads with identifiers ending with .1
and .2
.
I used this command to have correct reads:
MarkerMAG rename_reads -r1 ETHSEQ0000005220.1.fq -r2 ETHSEQ0000005220.2.fq -p test -fq -t 4
And then run MarkerMAG on the new files, but I get the same error. I also tried to run directly with fasta file (instead of fastq) and get the same error.
Hi AlessioMilanese,
Thanks for using MarkerMAG, a new release (1.1.27) is available now, can you please update and see if the error still exist? Weizhi
Hi @songweizhi
Thanks for checking and solving the problem. Seems to be working good now. For your information, this is what I got in the stdout:
[2022-07-04 09:43:43] parameters for linking
+ mismatch: 2%
+ min_M_len: 45bp
+ min_M_pct: 35%
+ min_link_num_gnm: 9
+ min_link_num_ctg: 3
+ rd2_end_seq_len: 1000bp
+ max_short_cigar_pct: 75,85
[2022-07-04 09:43:43] parameters for estimating copy number
+ MAG_cov_subsample_pct: 25%
+ min_insert_size_16s: -1000bp
+ ignore_ends_len_16s: 150bp
+ ignore_lowest_pct: 25%
+ ignore_highest_pct: 25%
+ both_pair_mapped: False
[2022-07-04 09:43:43] Rd1: identifying 16S rRNA genes in input MAGs with barrnap
[2022-07-04 09:43:52] Rd1: identify 16S rRNA genes in input MAGs finished
[2022-07-04 09:43:52] Rd1: removing 16S sequences at the end of MAG contigs
[2022-07-04 09:43:53] Rd1: remove 16S sequences at the end of MAG contigs finished
[2022-07-04 09:43:53] Rd1: quality control provided 16S rRNA gene sequences to:
[2022-07-04 09:43:53] Rd1: remove non-16S sequences (if any)
[2022-07-04 09:43:53] Rd1: cluster at 99% identity and keep only the longest one in each cluster
[2022-07-04 09:43:57] Rd1: qualified 16S rRNA gene sequences exported to: reconstructed_16_polished_min1200bp_c99.fa
[2022-07-04 09:44:05] Rd1: mapping input reads to marker genes with 8 cores (be patient!)
[2022-07-04 09:44:28] Rd1: sorting test_input_reads_to_16S.sam
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 09:44:32] Rd1: calculating the number of lines per subset
[2022-07-04 09:44:33] Rd1: splitting sam file
[2022-07-04 09:44:35] Rd1: analysing mappping results with 8 threads
[2022-07-04 09:44:37] Rd1: removing splitted subsets from disk
[2022-07-04 09:44:37] Rd1: reading filtered alignments into dict
[2022-07-04 09:44:38] Rd1: extracting sequences of reads matched to 16S
[2022-07-04 09:44:41] Rd1: mapping extracted reads to input genomes
[2022-07-04 09:45:05] Rd1: analysing mappping results
[2022-07-04 09:45:05] Rd1: processed 0.07k
[2022-07-04 09:45:05] Rd1: parsing MappingRecord dict to get linkages
[2022-07-04 09:45:05] Rd1: calculating pairwise 16S rRNA gene identities
[2022-07-04 09:45:11] Rd1: filtering linkages iteratively
[2022-07-04 09:45:11] Rd1: extracting linking reads for visualization
[2022-07-04 09:45:13] Rd1: visualizing 0 rd1 linkages with 8 threads
[2022-07-04 09:45:13] Rd2: get unlinked marker genes and genomes
[2022-07-04 09:45:13] Rd2: mapping input reads to the ends of contigs from unlinked genomes
[2022-07-04 09:45:59] Rd2: sorting mappping results
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 09:46:01] Rd2: calculating the number of lines per subset
[2022-07-04 09:46:02] Rd2: splitting sam file
[2022-07-04 09:46:04] Rd2: reading in mappping results with 8 threads
[2022-07-04 09:46:06] Rd2: removing splitted subsets from disk
[2022-07-04 09:46:10] Rd2: running SPAdes on extracted reads
[2022-07-04 09:46:17] Mini-assembly not found! will report 1st round linkages only!
[2022-07-04 09:46:17] Visualising linkages
[2022-07-04 09:46:18] Linking MAGs to 16S rRNA genes done!
[2022-07-04 09:46:18] Command for 16S copy number calculation exported to log file
[2022-07-04 09:46:18] Get_cn: running V-Xtractor on 16S rRNA genes
V-Xtractor v. 2.1. Copyright (c) Hartmann et al. 2010.
09:46:25 [==== ] 16.3% 09:47:20 Remaining: 0:00:46In sequence crc01+ETHSEQ0000005220+31, region V9 starts at 1390 and ends at 1236, which is backwards.
09:46:25 [===== ] 19.8% 09:47:20 Remaining: 0:00:44In sequence crc01+ETHSEQ0000005220+9, region V9 starts at 1384 and ends at 87, which is backwards.
09:46:25 [====== ] 22.1% 09:47:19 Remaining: 0:00:42In sequence crc01+ETHSEQ0000005220+42, region V9 starts at 1384 and ends at 721, which is backwards.
09:46:25 [========== ] 34.3% 09:47:23 Remaining: 0:00:38In sequence crc01+ETHSEQ0000005220+65, region V9 starts at 1272 and ends at 1059, which is backwards.
09:46:25 [====================== ] 75.7% 09:47:21 Remaining: 0:00:13In sequence crc01+ETHSEQ0000005220+72, region V9 starts at 1395 and ends at 442, which is backwards.
09:46:25 [==============================] 100% 09:47:21 Total: 0:00:56
[2022-07-04 09:47:22] Get_cn: subsampling reads for MAG coverage estimation
[2022-07-04 09:47:28] Get_cn: mapping reads to combined prefixed MAGs
Error: reads file does not look like a FASTA file
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT) (core dumped)
[2022-07-04 09:47:42] Get_cn: removing high mismatch alignments
[2022-07-04 09:47:43] Get_cn: removing singletons
[2022-07-04 09:47:43] Get_cn: sorting filtered sam file
[2022-07-04 09:47:43] Get_cn: getting depth with samtools
[2022-07-04 09:47:43] Get_cn: reading in MAG sequences
[2022-07-04 09:47:43] Get_cn: reading in Barrnap outputs
[2022-07-04 09:47:43] Get_cn: reading in depth file
[2022-07-04 09:47:43] Get_cn: getting Coverage and GC bias for 0 Genomes with 8 cores
[2022-07-04 09:47:43] Get_cn: get Coverage and GC bias done
cat: output_test/test_get_16S_cp_num_wd/test_MAG_depth_GC_content/*.txt: No such file or directory
[2022-07-04 09:47:43] Get_cn: splitting sorted sam file
[2022-07-04 09:47:46] Get_cn: parsing alignments of reads to all 16S with multiple threads
[2022-07-04 09:47:48] Get_cn: parsing alignments of reads to linked 16S with multiple threads
[2022-07-04 09:47:48] Get_cn: filtering sam file for all 16S
[2022-07-04 09:47:51] Get_cn: filtering sam file for linked 16S
[2022-07-04 09:47:52] Get_cn: sorting filtered sam file by contig id
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 09:47:52] Get_cn: getting depth directly from sam file (all 16S)
[2022-07-04 09:47:52] Get_cn: getting depth directly from sam file (linked 16S)
[2022-07-04 09:47:52] Get_cn: estimated copy number exported to test_copy_num_by_MAG.txt
[2022-07-04 09:47:52] Get_cn: removing tmp files
[2022-07-04 09:47:52] Get_cn: Done!
[2022-07-04 09:47:52] Estimated copy number 16S rRNA genes exported to:
[2022-07-04 09:47:52] output_test/test_copy_num_by_16S.txt
[2022-07-04 09:47:52] output_test/test_copy_num_by_MAG.txt
[2022-07-04 09:47:52] All done!
[2022-07-04 09:47:52] In summary:
1. Quality filtered 16S exported to: reconstructed_16_polished_min1200bp_c99.fa
2. Identified linkages exported to: test_linkages_by_contig/genome.txt
3. Linking profiles exported to: test_linkage_visualization_rd1/2
4. Estimated copy number exported to: test_copy_num_by_16S/MAG.txt
Unfortunately most of the results seems to be empty.
These files have only the header: test_copy_num_by_16S.txt
, test_copy_num_by_MAG.txt
, test_linkages_by_contig.txt
, test_linkages_by_genome.txt
.
I see:
Error: reads file does not look like a FASTA file
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT) (core dumped)
If I try to transform the input to fasta, seems there is no more error:
[2022-07-04 10:23:22] parameters for linking
+ mismatch: 2%
+ min_M_len: 45bp
+ min_M_pct: 35%
+ min_link_num_gnm: 9
+ min_link_num_ctg: 3
+ rd2_end_seq_len: 1000bp
+ max_short_cigar_pct: 75,85
[2022-07-04 10:23:22] parameters for estimating copy number
+ MAG_cov_subsample_pct: 25%
+ min_insert_size_16s: -1000bp
+ ignore_ends_len_16s: 150bp
+ ignore_lowest_pct: 25%
+ ignore_highest_pct: 25%
+ both_pair_mapped: False
[2022-07-04 10:23:22] Rd1: identifying 16S rRNA genes in input MAGs with barrnap
[2022-07-04 10:23:30] Rd1: identify 16S rRNA genes in input MAGs finished
[2022-07-04 10:23:30] Rd1: removing 16S sequences at the end of MAG contigs
[2022-07-04 10:23:30] Rd1: remove 16S sequences at the end of MAG contigs finished
[2022-07-04 10:23:30] Rd1: quality control provided 16S rRNA gene sequences to:
[2022-07-04 10:23:30] Rd1: remove non-16S sequences (if any)
[2022-07-04 10:23:30] Rd1: cluster at 99% identity and keep only the longest one in each cluster
[2022-07-04 10:23:35] Rd1: qualified 16S rRNA gene sequences exported to: reconstructed_16_polished_min1200bp_c99.fa
[2022-07-04 10:23:37] Rd1: mapping input reads to marker genes with 8 cores (be patient!)
[2022-07-04 10:24:00] Rd1: sorting test_input_reads_to_16S.sam
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 10:24:02] Rd1: calculating the number of lines per subset
[2022-07-04 10:24:02] Rd1: splitting sam file
[2022-07-04 10:24:04] Rd1: analysing mappping results with 8 threads
[2022-07-04 10:24:06] Rd1: removing splitted subsets from disk
[2022-07-04 10:24:06] Rd1: reading filtered alignments into dict
[2022-07-04 10:24:07] Rd1: extracting sequences of reads matched to 16S
[2022-07-04 10:24:09] Rd1: mapping extracted reads to input genomes
[2022-07-04 10:24:35] Rd1: analysing mappping results
[2022-07-04 10:24:35] Rd1: processed 0.07k
[2022-07-04 10:24:35] Rd1: parsing MappingRecord dict to get linkages
[2022-07-04 10:24:35] Rd1: calculating pairwise 16S rRNA gene identities
[2022-07-04 10:24:37] Rd1: filtering linkages iteratively
[2022-07-04 10:24:37] Rd1: extracting linking reads for visualization
[2022-07-04 10:24:39] Rd1: visualizing 0 rd1 linkages with 8 threads
[2022-07-04 10:24:40] Rd2: get unlinked marker genes and genomes
[2022-07-04 10:24:40] Rd2: mapping input reads to the ends of contigs from unlinked genomes
[2022-07-04 10:25:23] Rd2: sorting mappping results
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 10:25:26] Rd2: calculating the number of lines per subset
[2022-07-04 10:25:26] Rd2: splitting sam file
[2022-07-04 10:25:28] Rd2: reading in mappping results with 8 threads
[2022-07-04 10:25:30] Rd2: removing splitted subsets from disk
[2022-07-04 10:25:34] Rd2: running SPAdes on extracted reads
[2022-07-04 10:25:46] Mini-assembly not found! will report 1st round linkages only!
[2022-07-04 10:25:46] Visualising linkages
[2022-07-04 10:25:47] Linking MAGs to 16S rRNA genes done!
[2022-07-04 10:25:47] Command for 16S copy number calculation exported to log file
[2022-07-04 10:25:47] Get_cn: running V-Xtractor on 16S rRNA genes
V-Xtractor v. 2.1. Copyright (c) Hartmann et al. 2010.
10:25:53 [===== ] 17.4% 10:26:50 Remaining: 0:00:47In sequence crc01+ETHSEQ0000005220+31, region V9 starts at 1390 and ends at 1236, which is backwards.
10:25:53 [===== ] 18.6% 10:26:51 Remaining: 0:00:47In sequence crc01+ETHSEQ0000005220+9, region V9 starts at 1384 and ends at 87, which is backwards.
10:25:53 [====== ] 22.1% 10:26:51 Remaining: 0:00:45In sequence crc01+ETHSEQ0000005220+42, region V9 starts at 1384 and ends at 721, which is backwards.
10:25:53 [========== ] 34.3% 10:26:51 Remaining: 0:00:38In sequence crc01+ETHSEQ0000005220+65, region V9 starts at 1272 and ends at 1059, which is backwards.
10:25:53 [====================== ] 75.7% 10:26:49 Remaining: 0:00:13In sequence crc01+ETHSEQ0000005220+72, region V9 starts at 1395 and ends at 442, which is backwards.
10:25:53 [==============================] 100% 10:26:49 Total: 0:00:56
[2022-07-04 10:26:50] Get_cn: subsampling reads for MAG coverage estimation
[2022-07-04 10:26:54] Get_cn: mapping reads to combined prefixed MAGs
[2022-07-04 10:27:22] Get_cn: removing high mismatch alignments
[2022-07-04 10:27:26] Get_cn: removing singletons
[2022-07-04 10:27:31] Get_cn: sorting filtered sam file
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 10:27:36] Get_cn: getting depth with samtools
[2022-07-04 10:27:46] Get_cn: reading in MAG sequences
[2022-07-04 10:27:46] Get_cn: reading in Barrnap outputs
[2022-07-04 10:27:46] Get_cn: reading in depth file
[2022-07-04 10:28:14] Get_cn: getting Coverage and GC bias for 0 Genomes with 8 cores
[2022-07-04 10:28:15] Get_cn: get Coverage and GC bias done
cat: output_test_fasta/test_get_16S_cp_num_wd/test_MAG_depth_GC_content/*.txt: No such file or directory
[2022-07-04 10:28:15] Get_cn: splitting sorted sam file
[2022-07-04 10:28:18] Get_cn: parsing alignments of reads to all 16S with multiple threads
[2022-07-04 10:28:21] Get_cn: parsing alignments of reads to linked 16S with multiple threads
[2022-07-04 10:28:22] Get_cn: filtering sam file for all 16S
[2022-07-04 10:28:24] Get_cn: filtering sam file for linked 16S
[2022-07-04 10:28:25] Get_cn: sorting filtered sam file by contig id
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[2022-07-04 10:28:25] Get_cn: getting depth directly from sam file (all 16S)
[2022-07-04 10:28:25] Get_cn: getting depth directly from sam file (linked 16S)
[2022-07-04 10:28:25] Get_cn: estimated copy number exported to test_copy_num_by_MAG.txt
[2022-07-04 10:28:25] Get_cn: removing tmp files
[2022-07-04 10:28:25] Get_cn: Done!
[2022-07-04 10:28:27] Estimated copy number 16S rRNA genes exported to:
[2022-07-04 10:28:27] output_test_fasta/test_copy_num_by_16S.txt
[2022-07-04 10:28:27] output_test_fasta/test_copy_num_by_MAG.txt
[2022-07-04 10:28:27] All done!
[2022-07-04 10:28:27] In summary:
1. Quality filtered 16S exported to: reconstructed_16_polished_min1200bp_c99.fa
2. Identified linkages exported to: test_linkages_by_contig/genome.txt
3. Linking profiles exported to: test_linkage_visualization_rd1/2
4. Estimated copy number exported to: test_copy_num_by_16S/MAG.txt
But the result files are still empty.
The input fasta looks like:
$ head test_R1.fasta
>test_1.1
GATCTCTTTCCACATCTACAATGCAGGTTATGTTGAAGTTGGAAACATTTTCTTTCATCATAGGTAAGAGTTCCTGCGGAAAATAGAAA
>test_2.1
CAATAAGGCTCAGCGGAATACACATCATCACCATGATGGACAGCCGGGGGG
>test_3.1
AAGATGTAATACTCATAAGGGACTTCCTCGGTGGTGGTTTCCCCGGTTTCCGGGTCGGTACTTGTTTCGGTGCGGT
>test_4.1
CCGCAAAGGTCAACGGAACGGTAACGGAAATCGCACGAGATAAGACAAAAACAAAGTCGAGCTGTCGGACACTTCCCCTAATCCCTGCCTGTGAG
>test_5.1
ATCGAGCTAATCCAACCAGCAAGTGCTTGACCAGACATCCCGATATTGAAGAGACCTACTTTCATTGCTACCGCAAAAGATAAAGC
The input fasta file looks all good. Do you have any idea on the coverage of the input MAGs? For a MAG to be linked to 16S rRNA gene, a sequencing depth of 20x is normally needed.
I see, is there a way to lower this threshold? I will also check the average base coverage and try other samples!
Update:
With the new version (updated with conda*), things worked for me.
I've tried it on a varied group of metagenomic samples. Got nearly all of the bins matched with 16S in a low-diversity enriched culture. More limited for datasets with metagenomes with limited sequence data (few bins) and for some high-diversity metagenomes, but still, very impressive! Round 2 got a couple of additional results for the rare metagenome which was both high-depth and highly diverse. In terms of resources, I got most of the work done with 12 cores, 1GB each, except for a larger metagenome (~50Gbases) that required ~40-50 GB of memory in total.
Thank you for the useful software!
*For some reason, upgrading with conda automatically downgraded bowtie2 to a version that MarkerMag was no longer compatible with. I could upgrade bowtie2 again with conda and that worked, it was just unusual. Not sure if it's just my setup.
Hi
I'm using the latest version, but I also have this problem, the bowtie2 version is not compatible and there are errors during the runtime.
Best, xnw
Hi,
Thanks for the nice tool. When I try to run this command:
I have the following error:
And in the log file (
test.log
) I see: