Closed kevinmneal closed 5 years ago
I'll test it out now and report back
When I run faststructure with a file written by write_faststructure, I get this error:
Traceback (most recent call last):
File "/mnt/Data5/kevinRAD/RAD_2018-02-01/ipyrad_min100/highqual_min220_outfiles/faststructure/fastStructure/structure.py", line 172, in <module>
G = parse_str.load(params['inputfile'])
File "parse_str.pyx", line 10, in parse_str.load
L = loci.shape[1]
IndexError: tuple index out of range
Also, I have to change the file suffix to ".str" or else faststructure won't read it.
The faststructure output: OCreduced148.radiator.75pctmsng.oneSNPmac3.faststructure.txt
Actually, I checked my other faststructure files; they don't have the locus name header. I just removed the locus name header/the entire first line, and it's running now
I removed the top line (but it might have been something else, the top line pas just markers, not accounting for the first 6 columns...).
genomic_converter
: should work too
Hi Thierry
Trying out the latest version (only just updated an hour ago; from the version you last posted 6 days ago that added faststructure). I'm getting errors with genomic_converter for genepop, hierfstat, and structure: Error in .f(.x[[i]], ...) : object 'GT' not found
. My hunch is it is because my strata again has populations with only 1 individual (I'd removed them in the previous dataset, but haven't done so for the dataset I'm using here, I'll test it out adding the single individuals to a larger population) UPDATE: still get the errors.
Also a different error when writing betadiv, and plink :
Generating betadiv object
Calibrating REF/ALT alleles...
number of REF/ALT switch = 9
Error in .f(.x[[i]], ...) : object 'POP_ID' not found
Full function call and return below. Note this is a slightly different dataset than what I've already sent you (different individuals over a wider geographic range, but sequencing and assembly were the same)
> genomic_converter(data = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad",
+ strata = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt",
+ verbose = TRUE,
+ filename="sphasouth178spatial.radiator.50pctmsng.oneSNPmac3",
+ parallel.core = 1,
+ fig.upsetr=TRUE,
+ filter.common.markers=FALSE,
+ output=c("vcf", "fineradstructure", "tidy", "plink", "ldna", "stockr", "genind", "genlight", "structure", "bayescan", "betadiv", "related"))
################################################################################
########################## radiator::genomic_converter #########################
################################################################################
Execution date@time: 20190329@1757
genomic_converter function call arguments:
data = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad
strata = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt
output = vcf, fineradstructure, tidy, plink, ldna, stockr, genind, genlight, structure, bayescan, betadiv, related
filename = sphasouth178spatial.radiator.50pctmsng.oneSNPmac3
parallel.core = 1
verbose = TRUE
dots-dots-dots ... arguments
Arguments inside "..." assigned in genomic_converter:
filter.common.markers = FALSE
Default "..." arguments assigned in genomic_converter:
blacklist.genotypes = NULL
filter.monomorphic = TRUE
internal = FALSE
keep.allele.names = FALSE
parameters = NULL
path.folder = NULL
vcf.metadata = TRUE
vcf.stats = TRUE
whitelist.markers = NULL
Unknowned arguments identified inside "...":
fig.upsetr
Read documentation, for latest changes, and modify your codes!
Folder created: 21_radiator_genomic_converter_20190329@1757
File written: radiator_genomic_converter_args_20190329@1757.tsv
Filters parameters file generated: filters_parameters_20190329@1757.tsv
Importing data
Synchronizing data and strata...
Number of strata: 31
Number of individuals: 178
Writing tidy data set:
sphasouth178spatial.radiator.50pctmsng.oneSNPmac3.rad
Preparing data for output
Data is bi-allelic
Generating structure file
Error in .f(.x[[i]], ...) : object 'GT' not found
Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):
##fileformat=VCFv4.3
##fileDate=20190329
##source=radiator_v.1.0.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
1 locus_100029__100029__50 NA C T . PASS NS=115 GT
1 locus_100077__100077__49 NA A G . PASS NS=174 GT
1 locus_10009__10009__11 NA C T . PASS NS=178 GT
1 locus_100137__100137__36 NA G T . PASS NS=163 GT
1 locus_100170__100170__22 NA T A . PASS NS=102 GT
1 locus_100203__100203__29 NA C T . PASS NS=108 GT
This is what the input vcf looks like:
##fileformat=VCFv4.0
##fileDate=2019/01/13
##source=ipyrad_v.0.7.28
##reference=Spea_genomeassembly_fromEvan.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
locus_100006 5 . G A 13 PASS NS=120;DP=2002 GT:DP:CATG
locus_100014 8 . G T 13 PASS NS=251;DP=8028 GT:DP:CATG
locus_100023 12 . C A 13 PASS NS=95;DP=2406 GT:DP:CATG
locus_100029 50 . C T 13 PASS NS=173;DP=3987 GT:DP:CATG
locus_100038 19 . T C 13 PASS NS=261;DP=9508 GT:DP:CATG
locus_100042 50 . C A 13 PASS NS=91;DP=887 GT:DP:CATG
I’ll have a look at it in 1 hour
It's the weekend, no rush! (unless you want to)
Thanks
Working non stop for 2weeks Workshop is next week, got to have radiator ready ;) https://thierrygosselin.github.io/genomics-workshops/
Hi Thierry, Any updates on the next version of radiator coming out? Running into some of the same problems stated above using filter_rad on DArT data:
Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1458\01_radiator\radiator_20190418@1458.gds.rd'. The process cannot access the file because it is being used by another process.
Also:
In grid.Call(C_textBounds, as.graphicsAnnot(x$label), ... : font family not found in Windows font database
Thanks!
Chase
I’m sick and won’t be like to work or push something before Monday next week.
Some of you’re problem seems to be related to parallel processing. Try setting the argument to use 1 core. But it can not access the file which seems to another problem: 1) make sure you have read/write permission 2) why is the file ending with ‘gds.rd’ ? It should be automatically set to ‘gds.rad’, weird..
The other one about don’t family is new to me. I removed all mention of font family so that windows machine wouldn’t have to struggle with helvetica font so it should pick whatever you have as default.
Have you done ggplot fig before ? Got similar errors ?
Hey Thierry,
Thanks for the quick response! Hope you get feeling better soon.
-Setting to 1 core seemed to fix the errors. Even with the font family. Never had the error with ggplot either so strange
-Files are saving as .rad not .rd something must have gotten deleted when I was copying over
I'll let you know if I run in to anything else.
How many cores do you have on your PC?
8 cores on my PC
For some functions, SeqArray package à some problems dealing with parallel on PC (because of the lack of forking, he uses something else). So for those I’m detecting the OS and setting it automatically.
If you encounter another problem let me know exactly when it happens in the pipeline
8 cores shouldn’t be any problem
If I specify more than 1 core I get an error during the first common marker filter:
Scanning for common markers... Generating UpSet plot to visualize markers in common Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1527\01_radiator\radiator_20190418@1527.gds.rad'. The process cannot access the file because it is being used by another process
If I specify 1 core I get an error at the MAC filter step:
Error in SeqArray::seqGetData(gds, "annotation/format/AD") : The GDS node "annotation/format/AD/data" does not exist.
New push on GitHub, Try again and let me know if it works or not
That worked! Thanks for all the help.
Hi Kevin, I'm closing the issue, because I think everything was covered. If I missed something, let me know by reopening it.
Hi Thierry,
Hadn't used filter_rad/genomic_converter in a while but reran filter_rad today with the latest version of radiator and still get this error that I brought up a month or so ago: `. The genomic_converter step of filter_rad Any ideas?
Transferring data to genomic converter...
Synchronizing data and strata...
Number of strata: 1
Number of individuals: 153
Writing tidy data set:
sphasouth153spatial.nohybrids.radiator.50pctmsng.oneSNPmac3.rad
Calibrating REF/ALT alleles...
number of REF/ALT switch = 10
Calibrating REF/ALT alleles...
number of REF/ALT switch = 10
Error in .f(.x[[i]], ...) : object 'GT' not found
Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):
##fileformat=VCFv4.3 ##fileDate=20190329 ##source=radiator_v.1.0.0 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 locus_100029__100029__50 NA C T . PASS NS=115 GT 1 locus_100077__100077__49 NA A G . PASS NS=174 GT 1 locus_10009__10009__11 NA C T . PASS NS=178 GT 1 locus_100137__100137__36 NA G T . PASS NS=163 GT 1 locus_100170__100170__22 NA T A . PASS NS=102 GT 1 locus_100203__100203__29 NA C T . PASS NS=108 GT
This is what the input vcf looks like:
##fileformat=VCFv4.0 ##fileDate=2019/01/13 ##source=ipyrad_v.0.7.28 ##reference=Spea_genomeassembly_fromEvan.fasta ##phasing=unphased ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT locus_100006 5 . G A 13 PASS NS=120;DP=2002 GT:DP:CATG locus_100014 8 . G T 13 PASS NS=251;DP=8028 GT:DP:CATG locus_100023 12 . C A 13 PASS NS=95;DP=2406 GT:DP:CATG locus_100029 50 . C T 13 PASS NS=173;DP=3987 GT:DP:CATG locus_100038 19 . T C 13 PASS NS=261;DP=9508 GT:DP:CATG locus_100042 50 . C A 13 PASS NS=91;DP=887 GT:DP:CATG
Hi Thierry,
A few small bugs I've noticed with filter_rad() in interactive mode:
-Any filter that removes samples/individuals causes failure in the next filtering step:
-detect_mixed_genomes doesn't abide by the parallel.core arg in filter_rad(); fixed by adding parallel.core = parallel.core in filter_rad:
-May need to remove strata=NULL from filter_hwe in filter_rad():
-when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.
-in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.
-Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.
-purely aesthetic, but when running on Windows, the font choice in the plots (Helvetica?) causes warnings:
Not issues, but questions/suggestions/requests: -Are there better explanations for how outliers/q75/iqr are calculated and applied? Is outliers just outside 95% CI? -filter_coverage step returns a plot of max mean coverage; a plot for min mean would be useful -In filter_genotyping, is the threshold applied per-strata or only on the total? -Long LD filtering appears to work but only if pruned WITHOUT missing data statistics when CHROMs represent contigs; pruning with missing data statistics doesn't remove anything. Is there a reason for this? Actually I'm not sure loci are pruned either way. Is it possible to collapse the CHROMs down to a single CHROM to do the long LD filtering? -outputting the full function call with args entered during the interactive session would help with reproducibility -asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time