Closed Anthony0312 closed 4 years ago
Dear Anthony, you said earlier it was dart data, here it shows a vcf file, normal ? Anyway, this is not normal behaviour and it's probably because of the dataset/strata, send me those files, I'll have a look.
Best Thierry
Dear Thierry,
It is dart data, I used either vcf, cvs or txt file... Just like I used in the previous version of the package.
I'm sending the files for you to take a look at. Thank you.
Please don't use file ending with .vcf
if in reality it's a DArT file.... that's just messing with software, there's nothing magical about detecting file type....
There a format code for VCF file... respect the standard.
.csv
?Hi Thierry,
I used the original DArT file and modified some things (I need to delete the first lines of the file, edit the name of the samples and delete samples that will not be used in the filtering...). I turned it into a .vcf file because it had worked in the previous version ... I also had used .txt file.
I have also used the original DArT file . csv and it didn't work. It stopped at step 5 and did not generate any figures.
I'm sending you the original file so you can take a look
Report_DFr17-2822_SNP_1.csv.zip https://drive.google.com/file/d/1rbNr7QS6CJduM0Vk3hud4CHC3oRr6iVa/view?usp=drive_web
Thierry Gosselin notifications@github.com escreveu no dia quinta, 19/03/2020 à(s) 13:04:
- vcf vs DArT
Please don't use file ending with .vcf if in reality it's a DArT file.... that's just messing with software, there's nothing magical about detecting file type....
There a format code for VCF file... respect the standard.
- This looks like a modified DArT file, not something DArT would send... Do you have the original .csv ?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/thierrygosselin/radiator/issues/81#issuecomment-601301284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3UZCPZ47FH2QA3JZD2OPDRIJGC5ANCNFSM4LOTZKPA .
Hi Thierry,
I used the original DArT file and modified some things (I need to delete the first lines of the file, edit the name of the samples and delete samples that will not be used in the filtering...). I turned it into a .vcf file because it had worked in the previous version ... I also had used .txt file.
I have also used the original DArT file . csv and it didn't work. It stopped at step 5 and did not generate any figures.
I am sending you the original file by email so you can have a look because, even when compressed, it is larger than 10M.
Hi Anthony,
Below some suggestions/recommendations
1. VCF
Again keep the .vcf
for real VCF files: https://samtools.github.io/hts-specs/VCFv4.3.pdf
The .vcf
file you sent, their is something wrong with the formatting. The AvgCountRef
and AvgCountSnp
columns are just one example. Also, the names have spaces at the end before the TAB (radiator takes care of this, but other software might not).
2. DArT files
it's better not to modify them, or if you do, do it properly in a text editor that allows you to see invisible character, otherwise you will break it.
The DArT file you sent is not the original one: the first one you sent, genotypes were on 1 row, the second file, your original ending with .csv
, the genotypes are on 2 rows and there's no sample name. Verify it's really the same...
3. DArT and names ....
.csv
you sent, the names seems to be on line 6 instead of line 7, here is what I did to make in work:dart.prob <- readr::read_csv(
file = "Report_DFr17-2822_SNP_1.csv",
col_names = FALSE,
n_max = 7# the number of lines to work with
) %>%
t %>%
tibble::as_tibble(.)
lines.with.star <- length(
which(
stringi::stri_detect_fixed(str = dart.prob$V1, pattern = "*")
)
)
# Here I use the 6th columns with ids and remove the species
dart.top.col <- dplyr::bind_rows(
dplyr::filter(dart.prob, dplyr::row_number() <= lines.with.star),
dplyr::filter(dart.prob, dplyr::row_number() > lines.with.star) %>%
dplyr::mutate(
V7 = stringi::stri_replace_all_fixed(
str = V6,
pattern = " [Original species: Allobates femoralis]",
replacement = "",
vectorize_all = FALSE)
)
)
# to write in the working directory if you need to work on it in a text editor or Excel:
readr::write_tsv(x = dart.top.col, path = "dart.top.col.tsv", col_names = FALSE)
# read back in R
dart.top.col <- readr::read_tsv(file = "dart.top.col.tsv", col_names = FALSE)
# transpose, merge with the rest of the DArT file by writting the new file
readr::write_csv(
x = dart.top.col %>% t %>% tibble::as_tibble(.),
path = "Report_DFr17-2822_SNP_1_mod.csv",
col_names = FALSE
)
readr::read_csv(file = "Report_DFr17-2822_SNP_1.csv", col_names = FALSE, skip = 7) %>%
readr::write_csv(
x = .,
path = "Report_DFr17-2822_SNP_1_mod.csv",
col_names = FALSE,
append = TRUE
)
# the warnings are expected
check the ids and generate the strata file...
id <- radiator::extract_dart_target_id(data = "Report_DFr17-2822_SNP_1_mod.csv")
# Then work on the file generated automatically to generate the strata file.
# Add the column: INDIVIDUALS and STRATA. Rename as you wish the INDIVIDUALS column.
# The TARGET_ID column remains the same as it will be used with the DArT file.
# Only the samples in the strata file will be used, so if you remove lines in it,
# those samples are blacklisted from the start, you can test it with the strata file
# you sent me: strata_zona.tsv. It only as 14 samples, and the DArT file as 268 samples...
4. filtering
test1 <- radiator::read_dart(
data = "Report_DFr17-2822_SNP_1_mod.csv",
strata = "strata_zona.tsv"
)
# reads fine
test2 <- radiator::filter_rad(
data = "Report_DFr17-2822_SNP_1_mod.csv",
strata = "strata_zona.tsv"
)
# works
re-install radiator, I've pushed some changes today
Hi Thierry,
After a few weeks of studying and trying to filter my data, I still couldn't finish it. This is really a problem for me. I can't finish filtering and stop at "13_filter_hwe", with the following message:
################################################ Execution date @ time: 20200420 @ 1730 Interactive mode: on Function call and arguments stored in: radiator_filter_hwe_args_20200420@1730.tsv using tidy data frame of genotypes as input skipping all filters Filters parameters file: initiated
Strata removed from analysis because n <10: BALBINA, DUCKE Note: removed strata are included back in datasets at the end
Summarizing data File written: genotypes.summary.tsv HWE analysis for pop: OVERALL | | 0%, ETA NA Error in X [i, 2]: subscript out of bounds In addition: Warning messages: 1: Removed 15 row (s) containing missing values (geom_path). 2: Removed 30 rows containing missing values (geom_point). 3: Removed 15 row (s) containing missing values (geom_path). 4: Removed 30 rows containing missing values (geom_point). 5: NAs introduced by coercion
#############################################
I don't know what may be going on. I'm basically filtering using outlier statistics. If you can look at my data again I would be very grateful. P.s. if you look at my strata, I selected only a few individuals (not the entire database), which I'm interested.
Thanks in advance.
Hi Thierry,
Below I describe the code used and the filtering response. Note that even using the current Filter_rad command it asks to read the document because the function arguments names have changed. It is not generating the figures anywhere even in interactive mode (it move on to the next step). The filtering ends in step 5 (filter_individuals), with an error in round(depth$AVG_COUNT_REF, 0):non-numeric argument to mathematical function.
*My code: filter_rad(data="snp_zona.vcf", strata = "strata_zona.tsv", interactive.filter = TRUE, output = NULL, filename = NULL, verbose = TRUE, parallel.core = parallel::detectCores() - 1)
*Filtration start and response
############################################################################################################# radiator::filter_rad ############################# ############################################################################ The function arguments names have changed: please read documentation
Execution date@time: 20200318@1039 Folder created: filter_rad_20200318@1039 Function call and arguments stored in: radiator_filter_rad_args_20200318@1039.tsv File written: random.seed (969052) Filters parameters file generated: filters_parameters_20200318@1039.tsv Reading DArT file... Number of blacklisted samples: 3 DArT SNP format: genotypes in 1 Row Generating genotypes and calibrating REF/ALT alleles... Number of markers recalibrated based on counts of allele: 40383 Generating GDS... File written: radiator_20200318@1039.gds.rad
Number of chrom: 1 Number of locus: 83889 Number of SNPs: 147595 Number of populations: 4 Number of individuals: 14
Number of ind/pop: FUFAM = 5 PFIG = 2 BALB = 4 RMOR = 3
Number of duplicate id: 0
Computation time, overall: 14 sec Filters parameters file: initiated #################################################################################################### radiator::filter_dart_reproducibility ########################## ############################################################################ Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_dart_reproducibility_args_20200318@1039.tsv
Interactive mode: on 2 steps to visualize and filter the data based on reproducibility: Step 1. Visualization Step 2. Choose the filtering threshold
Filters parameters file: initiated This filter requires REP_AVG info, skipping filtering...
Computation time, overall: 1 sec #################### completed filter_dart_reproducibility ############################# ############################################################################ ######################### radiator::filter_monomorphic ############################# ############################################################################ Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_monomorphic_args_20200318@1039.tsv Filters parameters file: initiated File written: blacklist.monomorphic.markers_20200318@1039.tsv Synchronizing markers.meta File written: whitelist.polymorphic.markers_20200318@1039.tsv Filters parameters file: updated ################################### RESULTS #################################
Filter monomorphic markers Number of individuals / strata / chrom / locus / SNP: Before: 14 / 4 / 1 / 83889 / 147595 Blacklisted: 0 / 0 / 0 / 44561 / 96638 After: 14 / 4 / 1 / 39328 / 50957
Computation time, overall: 8 sec ######################### completed filter_monomorphic ########################### ##################################################################################################### radiator::filter_common_markers ########################## Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_common_markers_args_20200318@1039.tsv Filters parameters file: initiated Scanning for common markers... Generating UpSet plot to visualize markers in common File written: blacklist.not.common.markers_20200318@1039.tsv File written: whitelist.common.markers_20200318@1039.tsv Filters parameters file: updated ################################### RESULTS #################################
Filter common markers: Number of individuals / strata / chrom / locus / SNP: Before: 14 / 4 / 1 / 39328 / 50957 Blacklisted: 0 / 0 / 0 / 31451 / 42726 After: 14 / 4 / 1 / 7877 / 8231
Computation time, overall: 5 sec ####################### completed filter_common_markers ########################### ##################################################################################################### radiator::filter_individuals ################################ Execution date@time: 20200318@1039 Function call and arguments stored in: radiator_filter_individuals_args_20200318@1039.tsv Interactive mode: on
Step 1. Visualization Step 2. Missingness Step 3. Heterozygosity Step 4. Total Coverage (if available)
Filters parameters file: initiated
Step 1. Visualization of samples QC
Error in round(depth$AVG_COUNT_REF, 0) : non-numeric argument to mathematical function In addition: Warning messages: 1: Unknown columns:
CHROMPOS-NANORANA-PARKERI-V2X
,ALNCNT-NANORANA-PARKERI-V2X
,ALNEVALUE-NANORANA-PARKERI-V2X
2: Unknown columns:AVG_COUNT_REF
,AVG_COUNT_SNP
,REP_AVG
3: Unknown or uninitialised column: 'AVG_COUNT_REF'. 4: Unknown or uninitialised column: 'AVG_COUNT_SNP'. 5: Unknown or uninitialised column: 'AVG_COUNT_REF'.Computation time, overall: 0 sec ######################### completed filter_individuals ######################### ############################# completed filter_rad #############################