thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

filter_rad issues #44

Closed kevinmneal closed 5 years ago

kevinmneal commented 5 years ago

Hi Thierry,

A few small bugs I've noticed with filter_rad() in interactive mode:

-Any filter that removes samples/individuals causes failure in the next filtering step:

Error: Column `MISSING_PROP` must be length 203 (the number of rows) or one, not 204

-detect_mixed_genomes doesn't abide by the parallel.core arg in filter_rad(); fixed by adding parallel.core = parallel.core in filter_rad:

gds <- detect_mixed_genomes(data = gds, interactive.filter = interactive.filter, 
    detect.mixed.genomes = detect.mixed.genomes, ind.heterozygosity.threshold = NULL, 
    parameters = filters.parameters, verbose = verbose, parallel.core = parallel.core,
    path.folder = wf, internal = FALSE) 

-May need to remove strata=NULL from filter_hwe in filter_rad():

gds <- filter_hwe(data = gds, interactive.filter = interactive.filter, 
    filter.hwe = filter.hwe, strata=NULL, hw.pop.threshold = hw.pop.threshold, 
    midp.threshold = midp.threshold, parallel.core = parallel.core, 
    parameters = filters.parameters, path.folder = wf, verbose = verbose, 
    internal = FALSE)

-when filtering by HWE, interactive mode doesn't always detect the asterisk inputs; I think this happened when I tried setting hw.pop.threshold equal to the number of pops, or it may happen when some strata are removed for having n < 10, but I don't remember exactly. I just tried to re-run on strata where none were removed and didn't get the error.

-in general, is there a way to exit the interactive mode? When the HWE filter couldn't detect my inputs, I had to restart the R session to get out.

-Transferring to genomic_converter requires doing the REF/ALT calibration again. Not a major issue but adds some time.

-purely aesthetic, but when running on Windows, the font choice in the plots (Helvetica?) causes warnings:

In grid.Call(C_textBounds, as.graphicsAnnot(x$label),  ... :
  font family not found in Windows font database

Not issues, but questions/suggestions/requests: -Are there better explanations for how outliers/q75/iqr are calculated and applied? Is outliers just outside 95% CI? -filter_coverage step returns a plot of max mean coverage; a plot for min mean would be useful -In filter_genotyping, is the threshold applied per-strata or only on the total? -Long LD filtering appears to work but only if pruned WITHOUT missing data statistics when CHROMs represent contigs; pruning with missing data statistics doesn't remove anything. Is there a reason for this? Actually I'm not sure loci are pruned either way. Is it possible to collapse the CHROMs down to a single CHROM to do the long LD filtering? -outputting the full function call with args entered during the interactive session would help with reproducibility -asking if you want to run a particular filtering step interactively, e.g. asking if you want to skip calculating HWE since it takes a long time

kevinmneal commented 5 years ago

I'll test it out now and report back

kevinmneal commented 5 years ago

When I run faststructure with a file written by write_faststructure, I get this error:

Traceback (most recent call last):
  File "/mnt/Data5/kevinRAD/RAD_2018-02-01/ipyrad_min100/highqual_min220_outfiles/faststructure/fastStructure/structure.py", line 172, in <module>
    G = parse_str.load(params['inputfile'])
  File "parse_str.pyx", line 10, in parse_str.load
    L = loci.shape[1]
IndexError: tuple index out of range

Also, I have to change the file suffix to ".str" or else faststructure won't read it.

The faststructure output: OCreduced148.radiator.75pctmsng.oneSNPmac3.faststructure.txt

kevinmneal commented 5 years ago

Actually, I checked my other faststructure files; they don't have the locus name header. I just removed the locus name header/the entire first line, and it's running now

thierrygosselin commented 5 years ago

I removed the top line (but it might have been something else, the top line pas just markers, not accounting for the first 6 columns...).

genomic_converter: should work too

kevinmneal commented 5 years ago

Hi Thierry

Trying out the latest version (only just updated an hour ago; from the version you last posted 6 days ago that added faststructure). I'm getting errors with genomic_converter for genepop, hierfstat, and structure: Error in .f(.x[[i]], ...) : object 'GT' not found. My hunch is it is because my strata again has populations with only 1 individual (I'd removed them in the previous dataset, but haven't done so for the dataset I'm using here, I'll test it out adding the single individuals to a larger population) UPDATE: still get the errors.

Also a different error when writing betadiv, and plink :

Generating betadiv object
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 9
Error in .f(.x[[i]], ...) : object 'POP_ID' not found

Full function call and return below. Note this is a slightly different dataset than what I've already sent you (different individuals over a wider geographic range, but sequencing and assembly were the same)

> genomic_converter(data = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad", 
+                                                                   strata = "G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt",
+                                                                   verbose = TRUE,
+                                                                   filename="sphasouth178spatial.radiator.50pctmsng.oneSNPmac3",
+                                                                   parallel.core = 1,
+                                                                   fig.upsetr=TRUE,
+                                                                   filter.common.markers=FALSE,
+                                                                   output=c("vcf", "fineradstructure", "tidy", "plink", "ldna", "stockr", "genind", "genlight", "structure", "bayescan", "betadiv", "related"))
################################################################################
########################## radiator::genomic_converter #########################
################################################################################
Execution date@time: 20190329@1757

genomic_converter function call arguments:
    data = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial_radiator/radiator_testing/filter_rad_20190329@1730/13_filtered/sphasouth178spatial.radiator.75pctmsng.oneSNPmac3.rad
    strata = G:/My Drive/Illumina Sequencing Data/20181212_rangewide/sphasouth178spatial/sphasouth178spatial_popfile_coords_radiatorstrata_allpops.txt
    output = vcf, fineradstructure, tidy, plink, ldna, stockr, genind, genlight, structure, bayescan, betadiv, related
    filename = sphasouth178spatial.radiator.50pctmsng.oneSNPmac3
    parallel.core = 1
    verbose = TRUE

dots-dots-dots ... arguments

Arguments inside "..." assigned in genomic_converter:
    filter.common.markers = FALSE

Default "..." arguments assigned in genomic_converter:
    blacklist.genotypes = NULL
    filter.monomorphic = TRUE
    internal = FALSE
    keep.allele.names = FALSE
    parameters = NULL
    path.folder = NULL
    vcf.metadata = TRUE
    vcf.stats = TRUE
    whitelist.markers = NULL

Unknowned arguments identified inside "...": 
    fig.upsetr

Read documentation, for latest changes, and modify your codes!

Folder created: 21_radiator_genomic_converter_20190329@1757
File written: radiator_genomic_converter_args_20190329@1757.tsv
Filters parameters file generated: filters_parameters_20190329@1757.tsv

Importing data

Synchronizing data and strata...
    Number of strata: 31
    Number of individuals: 178

Writing tidy data set:
sphasouth178spatial.radiator.50pctmsng.oneSNPmac3.rad

Preparing data for output

Data is bi-allelic
Generating structure file
Error in .f(.x[[i]], ...) : object 'GT' not found
kevinmneal commented 5 years ago

Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):

##fileformat=VCFv4.3
##fileDate=20190329
##source=radiator_v.1.0.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
1   locus_100029__100029__50    NA  C   T   .   PASS    NS=115  GT  
1   locus_100077__100077__49    NA  A   G   .   PASS    NS=174  GT  
1   locus_10009__10009__11  NA  C   T   .   PASS    NS=178  GT
1   locus_100137__100137__36    NA  G   T   .   PASS    NS=163  GT  
1   locus_100170__100170__22    NA  T   A   .   PASS    NS=102  GT  
1   locus_100203__100203__29    NA  C   T   .   PASS    NS=108  GT  

This is what the input vcf looks like:

##fileformat=VCFv4.0
##fileDate=2019/01/13
##source=ipyrad_v.0.7.28
##reference=Spea_genomeassembly_fromEvan.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
locus_100006    5   .   G   A   13  PASS    NS=120;DP=2002  GT:DP:CATG  
locus_100014    8   .   G   T   13  PASS    NS=251;DP=8028  GT:DP:CATG  
locus_100023    12  .   C   A   13  PASS    NS=95;DP=2406   GT:DP:CATG  
locus_100029    50  .   C   T   13  PASS    NS=173;DP=3987  GT:DP:CATG  
locus_100038    19  .   T   C   13  PASS    NS=261;DP=9508  GT:DP:CATG  
locus_100042    50  .   C   A   13  PASS    NS=91;DP=887    GT:DP:CATG  
thierrygosselin commented 5 years ago

I’ll have a look at it in 1 hour

kevinmneal commented 5 years ago

It's the weekend, no rush! (unless you want to)

Thanks

thierrygosselin commented 5 years ago

Working non stop for 2weeks Workshop is next week, got to have radiator ready ;) https://thierrygosselin.github.io/genomics-workshops/

chasesmith15 commented 5 years ago

Hi Thierry, Any updates on the next version of radiator coming out? Running into some of the same problems stated above using filter_rad on DArT data:

Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1458\01_radiator\radiator_20190418@1458.gds.rd'. The process cannot access the file because it is being used by another process.

Also: In grid.Call(C_textBounds, as.graphicsAnnot(x$label), ... : font family not found in Windows font database

Thanks!

Chase

thierrygosselin commented 5 years ago

I’m sick and won’t be like to work or push something before Monday next week.

Some of you’re problem seems to be related to parallel processing. Try setting the argument to use 1 core. But it can not access the file which seems to another problem: 1) make sure you have read/write permission 2) why is the file ending with ‘gds.rd’ ? It should be automatically set to ‘gds.rad’, weird..

thierrygosselin commented 5 years ago

The other one about don’t family is new to me. I removed all mention of font family so that windows machine wouldn’t have to struggle with helvetica font so it should pick whatever you have as default.

Have you done ggplot fig before ? Got similar errors ?

chasesmith15 commented 5 years ago

Hey Thierry,

Thanks for the quick response! Hope you get feeling better soon.

-Setting to 1 core seemed to fix the errors. Even with the font family. Never had the error with ggplot either so strange

-Files are saving as .rad not .rd something must have gotten deleted when I was copying over

I'll let you know if I run in to anything else.

thierrygosselin commented 5 years ago

How many cores do you have on your PC?

chasesmith15 commented 5 years ago

8 cores on my PC

thierrygosselin commented 5 years ago

For some functions, SeqArray package à some problems dealing with parallel on PC (because of the lack of forking, he uses something else). So for those I’m detecting the OS and setting it automatically.

If you encounter another problem let me know exactly when it happens in the pipeline

thierrygosselin commented 5 years ago

8 cores shouldn’t be any problem

chasesmith15 commented 5 years ago

If I specify more than 1 core I get an error during the first common marker filter: Scanning for common markers... Generating UpSet plot to visualize markers in common Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx, : One of the nodes produced an error: Can not open file 'C:\Users\Chase Smith\Desktop\Elliptio_All_Filter\filter_rad_20190418@1527\01_radiator\radiator_20190418@1527.gds.rad'. The process cannot access the file because it is being used by another process

If I specify 1 core I get an error at the MAC filter step: Error in SeqArray::seqGetData(gds, "annotation/format/AD") : The GDS node "annotation/format/AD/data" does not exist.

thierrygosselin commented 5 years ago

New push on GitHub, Try again and let me know if it works or not

chasesmith15 commented 5 years ago

That worked! Thanks for all the help.

thierrygosselin commented 5 years ago

Hi Kevin, I'm closing the issue, because I think everything was covered. If I missed something, let me know by reopening it.

kevinmneal commented 5 years ago

Hi Thierry,

Hadn't used filter_rad/genomic_converter in a while but reran filter_rad today with the latest version of radiator and still get this error that I brought up a month or so ago: `. The genomic_converter step of filter_rad Any ideas?

Transferring data to genomic converter...
Synchronizing data and strata...
    Number of strata: 1
    Number of individuals: 153

Writing tidy data set:
sphasouth153spatial.nohybrids.radiator.50pctmsng.oneSNPmac3.rad
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 10
Calibrating REF/ALT alleles...
    number of REF/ALT switch = 10
Error in .f(.x[[i]], ...) : object 'GT' not found

Looking at the output vcf file, it looks like there's an issue with parsing the original input vcf (which worked previously):

##fileformat=VCFv4.3
##fileDate=20190329
##source=radiator_v.1.0.0
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples W
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
1 locus_100029__100029__50    NA  C   T   .   PASS    NS=115  GT  
1 locus_100077__100077__49    NA  A   G   .   PASS    NS=174  GT  
1 locus_10009__10009__11  NA  C   T   .   PASS    NS=178  GT
1 locus_100137__100137__36    NA  G   T   .   PASS    NS=163  GT  
1 locus_100170__100170__22    NA  T   A   .   PASS    NS=102  GT  
1 locus_100203__100203__29    NA  C   T   .   PASS    NS=108  GT  

This is what the input vcf looks like:

##fileformat=VCFv4.0
##fileDate=2019/01/13
##source=ipyrad_v.0.7.28
##reference=Spea_genomeassembly_fromEvan.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With 
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
locus_100006  5   .   G   A   13  PASS    NS=120;DP=2002  GT:DP:CATG  
locus_100014  8   .   G   T   13  PASS    NS=251;DP=8028  GT:DP:CATG  
locus_100023  12  .   C   A   13  PASS    NS=95;DP=2406   GT:DP:CATG  
locus_100029  50  .   C   T   13  PASS    NS=173;DP=3987  GT:DP:CATG  
locus_100038  19  .   T   C   13  PASS    NS=261;DP=9508  GT:DP:CATG  
locus_100042  50  .   C   A   13  PASS    NS=91;DP=887    GT:DP:CATG