Closed cookiemonsterxxm closed 4 years ago
Hello - I've fixed this issue. The problem was that only 1 of the marker genes in the database had mapped reads. I've now added a check that at least 2 genes have mapped reads - if not then the program exits with an error. Note that AGS estimates are only accurate with at least a few hundred thousands reads (please see paper for more details). Your input file had too few reads for a meaningful and accurate AGS estimate.
Hello! Thank you for developing this wonderful tool, and for also monitoring this GitHub so closely -- your answers/advice in other issues have been very helpful to me!
I thought I would add to this issue in that I am also receiving the 'Type Error: 'NoneType' object is not iterable' when attempting to run MicrobeCensus with a for-loop:
[hansenzo@dev-intel18 ~]$ python MicrobeCensus/microbe_census_script.py
MicrobeCensus - estimation of average genome size from shotgun sequence data
version 1.1.0; github.com/snayfach/MicrobeCensus
Copyright (C) 2013-2015 Stephen Nayfach
Freely distributed under the GNU General Public License (GPLv3)
Not a gzipped file (b'@H')
Traceback (most recent call last):
File "MicrobeCensus/microbe_census_script.py", line 17, in <module>
average_genome_size, args = microbe_census.run_pipeline(args)
TypeError: 'NoneType' object is not iterable
For reference, my 'microbe_census_script.py' looks like this:
#! /usr/bin/env python
from microbe_census import microbe_census
a = open("/mnt/home/hansenzo/test_sampleIDs_trimmed.txt")
a1 = a.read().splitlines()
for i in a1:
args = {'seqfiles':['/mnt/home/hansenzo/amrplusplus/test/'+"".join(i)+'.non.host.R1.fastq.gz','/mnt/home/hansenzo/amrplusplus/test/'+"".join(i)+'.non.host.R2.fastq.gz'], 'threads':2, 'verbose':True}
average_genome_size, args = microbe_census.run_pipeline(args)
count_bases = microbe_census.count_bases(args)
genome_equivalents = count_bases/average_genome_size
print('Average genome size:', average_genome_size)
print('Base Count:', count_bases)
print('Genome Equivalents:', genome_equivalents)
f = open('/mnt/home/hansenzo/MicrobeCensus/mc_output_amrplusplus_Run1.txt','a+', encoding = 'utf-8')
vals = str("".join(i)) + ',' + str(average_genome_size) + ',' + str(count_bases) + ',' + str(genome_equivalents) + '\n'
f.write(vals)
f.close()
a.close()
Here, the 'test_sampleIDs_trimmed.txt' file contains each of the sample names (e.g. 'ER0043', 'ER0235', etc.) and I am attempting to iterate over these sample names to pull the R1 and R2 reads from a common directory. I am then trying to append results for all samples to a common file ('f'). This script has worked for me in the past when run on individual samples, but I have not yet tested it when constructing a loop. I definitely have much to learn in the way of Python, and so if my loop being improperly constructed is causing this 'Type Error', I'd love any potential advice/suggestions for correcting this and running MicrobeCensus with iteration.
I should also note that I had installed MicrobeCensus in the spring of 2019, but I believe my version number is consistent with the 'microbe_census.py' script on this GitHub (version 1.1.0). Could this 'Type Error' be due to a potential discrepancy between my version and the necessary changes you refer to in the above comment? In this case, would it be best to just re-perform the MicrobeCensus installation?
Thank you very much for your time and help!
Hello Zoe. From the error message it looks like your input file may not be properly compressed. Can you verify this? You might also try your same code but explicitly use python2. If those things don't work, you can try to update your code. Hope this helps!
Hi @snayfach -- thank you for catching this! I feel so silly for not noticing the line in the error code about my files not being gzipped. They had the .gz extension but were not properly compressed, and so this fixed my issue! It runs perfectly!
I have somewhat unrelated/more general questions and I'm not sure this it the most appropriate place for it, but I was wondering about your recommendation to not perform quality filtering if using MicrobeCensus for normalization. Do the -q, -m, -d, and -u flags have to be specified to be included? Or are they automatically included in the code with their default settings? I wasn't sure if I needed to somehow 'void' these parameters to avoid quality filtering, as I am using trimmed, host-removed reads that have already undergone filtering.
And, just to confirm, MicrobeCensus performs trimming to attain a consistent read length in order to estimate AGS and this length is designated by the -l flag? In my test run, both of my samples were trimmed to 150bp (which may be the median read length, since this is the default??) but many of my reads were shorter than this and were discarded (~300,000+) -- is it best to maintain this longer bp cutoff, or should I lower the -l setting to a shorter read length to include some of these discarded reads? Or would a better solution be to increase the -n setting to sample as many reads as possible, rather than potentially including less "important" reads in the AGS estimation?
Thank you, again, for your help and expertise!
I have somewhat unrelated/more general questions and I'm not sure this it the most appropriate place for it, but I was wondering about your recommendation to not perform quality filtering if using MicrobeCensus for normalization. Do the -q, -m, -d, and -u flags have to be specified to be included? Or are they automatically included in the code with their default settings? I wasn't sure if I needed to somehow 'void' these parameters to avoid quality filtering, as I am using trimmed, host-removed reads that have already undergone filtering.
The default is to perform no quality filtering, so no need to change any parameters.
And, just to confirm, MicrobeCensus performs trimming to attain a consistent read length in order to estimate AGS and this length is designated by the -l flag? In my test run, both of my samples were trimmed to 150bp (which may be the median read length, since this is the default??) but many of my reads were shorter than this and were discarded (~300,000+) -- is it best to maintain this longer bp cutoff, or should I lower the -l setting to a shorter read length to include some of these discarded reads? Or would a better solution be to increase the -n setting to sample as many reads as possible, rather than potentially including less "important" reads in the AGS estimation?
I would set the -n parameter very high to include the max number of reads and let MicrobeCensus choose the read length to use.
Best, Stephen
Most times, it worked well. But I have found a problem though:
xmixu@bm3:~/PATH/TO/DIR$ run_microbe_census.py home/PATH/TO/DIR/SRR7280924.extendedFrags.fastq /home/PATH/TO/DIR/SRR7280924.flash_mc.txt -t 16 -n 100000000 integer division or modulo by zero Traceback (most recent call last): File "/share/apps/bio/bio/bin/run_microbe_census.py", line 62, in
est_ags, args = microbe_census.run_pipeline(args)
TypeError: 'NoneType' object is not iterable
And I looked deep into the codes, it was in the function of estimate_average_genome_size(args, paths, agg_hits). Here, sum_weights equals to 0, causing the error. Can you please help to figure out what was going on? Thank you in advance!
Best, Xinming