thierrygosselin / radiator

RADseq Data Exploration, Manipulation and Visualization using R
https://thierrygosselin.github.io/radiator/
GNU General Public License v3.0
59 stars 23 forks source link

Filtering failed #81

Closed Anthony0312 closed 4 years ago

Anthony0312 commented 4 years ago

Hi Thierry,

Below I describe the code used and the filtering response. Note that even using the current Filter_rad command it asks to read the document because the function arguments names have changed. It is not generating the figures anywhere even in interactive mode (it move on to the next step). The filtering ends in step 5 (filter_individuals), with an error in round(depth$AVG_COUNT_REF, 0):non-numeric argument to mathematical function.

*My code: filter_rad(data="snp_zona.vcf", strata = "strata_zona.tsv", interactive.filter = TRUE, output = NULL, filename = NULL, verbose = TRUE, parallel.core = parallel::detectCores() - 1)

*Filtration start and response

############################################################################################################# radiator::filter_rad ############################# ############################################################################ The function arguments names have changed: please read documentation

Execution date@time: 20200318@1039 Folder created: filter_rad_20200318@1039 Function call and arguments stored in: radiator_filter_rad_args_20200318@1039.tsv File written: random.seed (969052) Filters parameters file generated: filters_parameters_20200318@1039.tsv Reading DArT file... Number of blacklisted samples: 3 DArT SNP format: genotypes in 1 Row Generating genotypes and calibrating REF/ALT alleles... Number of markers recalibrated based on counts of allele: 40383 Generating GDS... File written: radiator_20200318@1039.gds.rad

Number of chrom: 1 Number of locus: 83889 Number of SNPs: 147595 Number of populations: 4 Number of individuals: 14

Number of ind/pop: FUFAM = 5 PFIG = 2 BALB = 4 RMOR = 3

Number of duplicate id: 0

Computation time, overall: 14 sec Filters parameters file: initiated #################################################################################################### radiator::filter_dart_reproducibility ########################## ############################################################################ Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_dart_reproducibility_args_20200318@1039.tsv

Interactive mode: on 2 steps to visualize and filter the data based on reproducibility: Step 1. Visualization Step 2. Choose the filtering threshold

Filters parameters file: initiated This filter requires REP_AVG info, skipping filtering...

Computation time, overall: 1 sec #################### completed filter_dart_reproducibility ############################# ############################################################################ ######################### radiator::filter_monomorphic ############################# ############################################################################ Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_monomorphic_args_20200318@1039.tsv Filters parameters file: initiated File written: blacklist.monomorphic.markers_20200318@1039.tsv Synchronizing markers.meta File written: whitelist.polymorphic.markers_20200318@1039.tsv Filters parameters file: updated ################################### RESULTS #################################

Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Before: 14 / 4 / 1 / 83889 / 147595 Blacklisted: 0 / 0 / 0 / 44561 / 96638 After: 14 / 4 / 1 / 39328 / 50957

Computation time, overall: 8 sec ######################### completed filter_monomorphic ########################### ##################################################################################################### radiator::filter_common_markers ########################## Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_common_markers_args_20200318@1039.tsv Filters parameters file: initiated Scanning for common markers... Generating UpSet plot to visualize markers in common File written: blacklist.not.common.markers_20200318@1039.tsv File written: whitelist.common.markers_20200318@1039.tsv Filters parameters file: updated ################################### RESULTS #################################

Filter common markers: Number of individuals / strata / chrom / locus / SNP: Before: 14 / 4 / 1 / 39328 / 50957 Blacklisted: 0 / 0 / 0 / 31451 / 42726 After: 14 / 4 / 1 / 7877 / 8231

Computation time, overall: 5 sec ####################### completed filter_common_markers ########################### ##################################################################################################### radiator::filter_individuals ################################ Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_individuals_args_20200318@1039.tsv Interactive mode: on

Step 1. Visualization Step 2. Missingness Step 3. Heterozygosity Step 4. Total Coverage (if available)

Filters parameters file: initiated

Step 1. Visualization of samples QC

Error in round(depth$AVG_COUNT_REF, 0) : non-numeric argument to mathematical function In addition: Warning messages: 1: Unknown columns: CHROMPOS-NANORANA-PARKERI-V2X, ALNCNT-NANORANA-PARKERI-V2X, ALNEVALUE-NANORANA-PARKERI-V2X 2: Unknown columns: AVG_COUNT_REF, AVG_COUNT_SNP, REP_AVG 3: Unknown or uninitialised column: 'AVG_COUNT_REF'. 4: Unknown or uninitialised column: 'AVG_COUNT_SNP'. 5: Unknown or uninitialised column: 'AVG_COUNT_REF'.

Computation time, overall: 0 sec ######################### completed filter_individuals ######################### ############################# completed filter_rad #############################

thierrygosselin commented 4 years ago

Dear Anthony, you said earlier it was dart data, here it shows a vcf file, normal ? Anyway, this is not normal behaviour and it's probably because of the dataset/strata, send me those files, I'll have a look.

Best Thierry

Anthony0312 commented 4 years ago

Dear Thierry,

It is dart data, I used either vcf, cvs or txt file... Just like I used in the previous version of the package.

I'm sending the files for you to take a look at. Thank you.

Arquivo Comprimido.zip

thierrygosselin commented 4 years ago
  1. vcf vs DArT

Please don't use file ending with .vcf if in reality it's a DArT file.... that's just messing with software, there's nothing magical about detecting file type....

There a format code for VCF file... respect the standard.

  1. This looks like a modified DArT file, not something DArT would send... Do you have the original .csv ?
Anthony0312 commented 4 years ago

Hi Thierry,

I used the original DArT file and modified some things (I need to delete the first lines of the file, edit the name of the samples and delete samples that will not be used in the filtering...). I turned it into a .vcf file because it had worked in the previous version ... I also had used .txt file.

I have also used the original DArT file . csv and it didn't work. It stopped at step 5 and did not generate any figures.

I'm sending you the original file so you can take a look

Report_DFr17-2822_SNP_1.csv.zip https://drive.google.com/file/d/1rbNr7QS6CJduM0Vk3hud4CHC3oRr6iVa/view?usp=drive_web

Thierry Gosselin notifications@github.com escreveu no dia quinta, 19/03/2020 à(s) 13:04:

  1. vcf vs DArT

Please don't use file ending with .vcf if in reality it's a DArT file.... that's just messing with software, there's nothing magical about detecting file type....

There a format code for VCF file... respect the standard.

  1. This looks like a modified DArT file, not something DArT would send... Do you have the original .csv ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/thierrygosselin/radiator/issues/81#issuecomment-601301284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3UZCPZ47FH2QA3JZD2OPDRIJGC5ANCNFSM4LOTZKPA .

Anthony0312 commented 4 years ago

Hi Thierry,

I used the original DArT file and modified some things (I need to delete the first lines of the file, edit the name of the samples and delete samples that will not be used in the filtering...). I turned it into a .vcf file because it had worked in the previous version ... I also had used .txt file.

I have also used the original DArT file . csv and it didn't work. It stopped at step 5 and did not generate any figures.

I am sending you the original file by email so you can have a look because, even when compressed, it is larger than 10M.

thierrygosselin commented 4 years ago

Hi Anthony,

Below some suggestions/recommendations 1. VCF Again keep the .vcf for real VCF files: https://samtools.github.io/hts-specs/VCFv4.3.pdf

The .vcf file you sent, their is something wrong with the formatting. The AvgCountRef and AvgCountSnp columns are just one example. Also, the names have spaces at the end before the TAB (radiator takes care of this, but other software might not).

2. DArT files

3. DArT and names ....

dart.prob <- readr::read_csv(
  file = "Report_DFr17-2822_SNP_1.csv",
  col_names = FALSE,
  n_max = 7# the number of lines to work with
  ) %>%
  t %>%
  tibble::as_tibble(.)

lines.with.star <- length(
  which(
    stringi::stri_detect_fixed(str = dart.prob$V1, pattern = "*")
  )
)

# Here I use the 6th columns with ids and remove the species
dart.top.col <- dplyr::bind_rows(
  dplyr::filter(dart.prob, dplyr::row_number() <= lines.with.star),
  dplyr::filter(dart.prob, dplyr::row_number() > lines.with.star) %>%
    dplyr::mutate(
      V7 = stringi::stri_replace_all_fixed(
        str = V6,
        pattern = " [Original species: Allobates femoralis]",
        replacement = "",
        vectorize_all = FALSE)
    )
)
# to write in the working directory if you need to work on it in a text editor or Excel:
readr::write_tsv(x = dart.top.col, path = "dart.top.col.tsv", col_names = FALSE)

# read back in R
dart.top.col <- readr::read_tsv(file = "dart.top.col.tsv", col_names = FALSE)

# transpose, merge with the rest of the DArT file by writting the new file
readr::write_csv(
  x = dart.top.col %>% t %>% tibble::as_tibble(.),
  path = "Report_DFr17-2822_SNP_1_mod.csv",
  col_names = FALSE
  )
readr::read_csv(file = "Report_DFr17-2822_SNP_1.csv", col_names = FALSE, skip = 7) %>%
  readr::write_csv(
    x = .,
    path = "Report_DFr17-2822_SNP_1_mod.csv",
    col_names = FALSE,
    append = TRUE
  )
# the warnings are expected

check the ids and generate the strata file...

id <- radiator::extract_dart_target_id(data = "Report_DFr17-2822_SNP_1_mod.csv")
# Then work on the file generated automatically to generate the strata file.
# Add the column: INDIVIDUALS and STRATA. Rename as you wish the INDIVIDUALS column.
# The TARGET_ID column remains the same as it will be used with the DArT file.
# Only the samples in the strata file will be used, so if you remove lines in it,
# those samples are blacklisted from the start, you can test it with the strata file
# you sent me: strata_zona.tsv. It only as 14 samples, and the DArT file as 268 samples...

4. filtering

test1 <- radiator::read_dart(
data = "Report_DFr17-2822_SNP_1_mod.csv", 
strata = "strata_zona.tsv"
)
# reads fine
test2 <- radiator::filter_rad(
data = "Report_DFr17-2822_SNP_1_mod.csv", 
strata = "strata_zona.tsv"
)
# works
thierrygosselin commented 4 years ago

re-install radiator, I've pushed some changes today

Anthony0312 commented 4 years ago

Hi Thierry,

After a few weeks of studying and trying to filter my data, I still couldn't finish it. This is really a problem for me. I can't finish filtering and stop at "13_filter_hwe", with the following message:

################################################ Execution date @ time: 20200420 @ 1730 Interactive mode: on Function call and arguments stored in: radiator_filter_hwe_args_20200420@1730.tsv     using tidy data frame of genotypes as input     skipping all filters Filters parameters file: initiated

Strata removed from analysis because n <10: BALBINA, DUCKE     Note: removed strata are included back in datasets at the end

Summarizing data File written: genotypes.summary.tsv HWE analysis for pop: OVERALL   | | 0%, ETA NA Error in X [i, 2]: subscript out of bounds In addition: Warning messages: 1: Removed 15 row (s) containing missing values ​​(geom_path). 2: Removed 30 rows containing missing values ​​(geom_point). 3: Removed 15 row (s) containing missing values ​​(geom_path). 4: Removed 30 rows containing missing values ​​(geom_point). 5: NAs introduced by coercion

#############################################

I don't know what may be going on. I'm basically filtering using outlier statistics. If you can look at my data again I would be very grateful. P.s. if you look at my strata, I selected only a few individuals (not the entire database), which I'm interested.

Thanks in advance.

data.zip