Closed katcatalano closed 5 years ago
If it helps, I'm attaching the gen file I'm trying to get the meta-data for her. It's formatted for Colony, and has a subset of 696 loci (I'm still working out the sweet spot of loci for parentage, and I'll let you know once I know for identity). I also just deleted the ".second ligation_id" from re-genotyped samples, figuring I just needed the first ligation_id to retrieve the gen_id. Raw data can be found here: https://raw.githubusercontent.com/katcatalano/parentage/master/colony2/20190320_697loci/input/20190321_696loci_nold_maf0.2-0.5.txt?token=AY8C-QnJSL--Y7me-UuksnkX5E8cUbRMks5ck8AdwA%3D%3D
The code I'm using is below:
"%!in%" <- function(x,table) match(x,table, nomatch = 0) == 0
issues <- as.data.frame(labor %>% tbl("known_issues") %>% select(ligation_id))
get_samp <- function (x)
{
lig <- labor %>% tbl("ligation") %>% collect() %>% filter(ligation_id %in%
x) %>% select(ligation_id, digest_id)
dig <- labor %>% tbl("digest") %>% collect() %>% filter(digest_id %in%
lig$digest_id) %>% select(extraction_id, digest_id)
extr <- labor %>% tbl("extraction") %>% collect() %>% filter(extraction_id %in%
dig$extraction_id) %>% select(extraction_id, sample_id)
samp <- leyte %>% tbl("clownfish") %>% collect() %>% filter(sample_id %in%
extr$sample_id) %>% select(sample_id, gen_id)
one <- left_join(lig, dig, by = "digest_id")
two <- left_join(extr, one, by = "extraction_id") %>% select(sample_id,
ligation_id)
three <- left_join(samp, two, by = "sample_id") %>% select(sample_id,
ligation_id, gen_id)
return(three)
}
get_fish <- function ()
{
fish <- leyte %>% tbl("clownfish") %>% select(fish_table_id,
anem_table_id, fish_spp, size, color, recap, fish_obs_time,
gen_id, sample_id) %>% collect()
return(fish)
}
#load sequencing data
dat_gen <- 20190321_696loci_nold_maf0.2-0.5.txt #should have 2544 individuals
fish_in_gen_lig <- dat_gen %>% #pull of the ligation_ids of fish in dat_gen, and remove those in the issues table
select(ligation_id) %>%
filter(ligation_id %!in% issues$ligation_id) #54 fish in issues table, these are dropped
nrow(dat_gen) -nrow(fish_in_gen_lig)
fish_in_gen_samp <- get_samp(fish_in_gen_lig$ligation_id) %>% #should be 2490
distinct(ligation_id, sample_id, .keep_all = T) #all of the ligation_ids in fish_in_gen_samp are in fish_in_gen_lig. Are the extra 7 sample_ids just multiples for regenotyped fish? YES see the following when this line is commented out: test %>% group_by(ligation_id, sample_id) %>% summarise(n=n()) %>% filter(n >1)
nrow(fish_in_gen_lig)- nrow(fish_in_gen_samp) #should be 0
fish_ids <- get_fish() %>% #get all identifying info for a sample
filter(sample_id %in% fish_in_gen_samp$sample_id) %>%
distinct(sample_id, .keep_all = T)
#who has no gen_id in this?
test <- (fish_ids %>%
filter(is.na(gen_id))) #399 fish
#test$recap #interesting. These have NA or N values. So maybe they aren't regenotypes and can just be sorted, but let's see what Michelle says for now. Those with "NA" in the recap column may be related to issue #7, but this is pulled fresh from the database.
fish_ids <- left_join(fish_in_gen_samp, fish_ids, by=c("sample_id", "gen_id"))#join the ligation_ids back in
fish_date <- get_date(fish_ids$sample_id)%>%# add the date to the sample_id
distinct(sample_id, .keep_all = T)
fish_meta <- left_join(fish_ids, fish_date, by="sample_id") #join with the rest of the data
##join with genetic data
dat_gen_meta <- left_join(fish_meta, dat_gen, by="ligation_id")
It looks like all the fish without gen_ids also have NA or N in the recap column. Maybe this is related to issue #7 ? Again, not urgent, just updating with more information I found.
Was this not solved by #6 ? Also, I've added an urgent label to the label options so I know if I need to stop working on something else and address an issue immediately.
Okay, cool. It doesn't look to me like this is solved by #6, because I'm pulling the data from the database. Maybe I'm missing something?
I see now, based on the analysis done in #6, all of the samples missing gen_ids were from earlier years. This new Rdata file you've provided has 2017/2018 fish missing gen_id. I will re-do the analysis and see what is going on.
I pulled from the database again, and there are 172 samples without gen_ids now (in https://github.com/katcatalano/parentage/blob/master/172_nogenid.rds, 172_nogenid.rds). Not sure if you're still working on this because the issue is now closed.
Still finding fish in seq03-33 without a gen_id. I think this is because we refiltered and included more individuals. These changes have to be made in the leyteBuildDB > db_corrections.Rmd script because they have to be reproduced if the database is restored from an older backup. The plan is to: 1) assign all fish in the filtered genepop with a gen_id. done 2) run regenotype analysis to make sure any regenotyped fish are assigned the same gen_id. done 3) run identity analysis to make sure any recaptured fish are assigned the same gen_id. done
Apologies if this is a repeat issue, I thought I opened this before my vacation, but apparently I didn't... I uploaded a RData file with a data frame of these in https://github.com/katcatalano/parentage/blob/master/nogenid_seq03-33.rds There may be something buggy with my code that is attaching meta-data. Maybe try attaching meta-data (sample_id, gen_id) to the ligation_ids in the seq03-33 genepop file with your workflow? If that works, then this is a bug in my code, but so far I haven't figured it out. Thanks!