pinskylab / genomics

Wrangling of genomic data and identity analysis
3 stars 2 forks source link

fish in seq03-33 without gen_id #10

Closed katcatalano closed 5 years ago

katcatalano commented 5 years ago

Apologies if this is a repeat issue, I thought I opened this before my vacation, but apparently I didn't... I uploaded a RData file with a data frame of these in https://github.com/katcatalano/parentage/blob/master/nogenid_seq03-33.rds There may be something buggy with my code that is attaching meta-data. Maybe try attaching meta-data (sample_id, gen_id) to the ligation_ids in the seq03-33 genepop file with your workflow? If that works, then this is a bug in my code, but so far I haven't figured it out. Thanks!

katcatalano commented 5 years ago

If it helps, I'm attaching the gen file I'm trying to get the meta-data for her. It's formatted for Colony, and has a subset of 696 loci (I'm still working out the sweet spot of loci for parentage, and I'll let you know once I know for identity). I also just deleted the ".second ligation_id" from re-genotyped samples, figuring I just needed the first ligation_id to retrieve the gen_id. Raw data can be found here: https://raw.githubusercontent.com/katcatalano/parentage/master/colony2/20190320_697loci/input/20190321_696loci_nold_maf0.2-0.5.txt?token=AY8C-QnJSL--Y7me-UuksnkX5E8cUbRMks5ck8AdwA%3D%3D

The code I'm using is below:

"%!in%" <- function(x,table) match(x,table, nomatch = 0) == 0
issues <- as.data.frame(labor %>% tbl("known_issues") %>% select(ligation_id))

get_samp <- function (x) 
{
    lig <- labor %>% tbl("ligation") %>% collect() %>% filter(ligation_id %in% 
        x) %>% select(ligation_id, digest_id)
    dig <- labor %>% tbl("digest") %>% collect() %>% filter(digest_id %in% 
        lig$digest_id) %>% select(extraction_id, digest_id)
    extr <- labor %>% tbl("extraction") %>% collect() %>% filter(extraction_id %in% 
        dig$extraction_id) %>% select(extraction_id, sample_id)
    samp <- leyte %>% tbl("clownfish") %>% collect() %>% filter(sample_id %in% 
        extr$sample_id) %>% select(sample_id, gen_id)
    one <- left_join(lig, dig, by = "digest_id")
    two <- left_join(extr, one, by = "extraction_id") %>% select(sample_id, 
        ligation_id)
    three <- left_join(samp, two, by = "sample_id") %>% select(sample_id, 
        ligation_id, gen_id)
    return(three)
}

get_fish <- function () 
{
    fish <- leyte %>% tbl("clownfish") %>% select(fish_table_id, 
        anem_table_id, fish_spp, size, color, recap, fish_obs_time, 
        gen_id, sample_id) %>% collect()
    return(fish)
}

#load sequencing data
dat_gen <- 20190321_696loci_nold_maf0.2-0.5.txt #should have 2544 individuals

fish_in_gen_lig <- dat_gen %>% #pull of the ligation_ids of fish in dat_gen, and remove those in the issues table
    select(ligation_id) %>%
    filter(ligation_id %!in% issues$ligation_id) #54 fish in issues table, these are dropped
nrow(dat_gen) -nrow(fish_in_gen_lig) 

fish_in_gen_samp <- get_samp(fish_in_gen_lig$ligation_id) %>% #should be 2490 
    distinct(ligation_id, sample_id, .keep_all = T) #all of the ligation_ids in fish_in_gen_samp are in fish_in_gen_lig. Are the extra 7 sample_ids just multiples for regenotyped fish? YES see the following when this line is commented out: test %>% group_by(ligation_id, sample_id) %>% summarise(n=n()) %>% filter(n >1)
nrow(fish_in_gen_lig)- nrow(fish_in_gen_samp) #should be 0

fish_ids <- get_fish() %>% #get all identifying info for a sample
    filter(sample_id %in% fish_in_gen_samp$sample_id) %>%
    distinct(sample_id, .keep_all = T)
#who has no gen_id in this?
test <- (fish_ids %>%
    filter(is.na(gen_id))) #399 fish
#test$recap #interesting. These have NA or N values. So maybe they aren't regenotypes and can just be sorted, but let's see what Michelle says for now. Those with "NA" in the recap column may be related to issue #7, but this is pulled fresh from the database. 

fish_ids <- left_join(fish_in_gen_samp, fish_ids, by=c("sample_id", "gen_id"))#join the ligation_ids back in
fish_date <- get_date(fish_ids$sample_id)%>%# add the date to the sample_id
    distinct(sample_id, .keep_all = T) 
fish_meta <- left_join(fish_ids, fish_date, by="sample_id") #join with the rest of the data

##join with genetic data
dat_gen_meta <- left_join(fish_meta, dat_gen, by="ligation_id")

It looks like all the fish without gen_ids also have NA or N in the recap column. Maybe this is related to issue #7 ? Again, not urgent, just updating with more information I found.

mstuart1 commented 5 years ago

Was this not solved by #6 ? Also, I've added an urgent label to the label options so I know if I need to stop working on something else and address an issue immediately.

katcatalano commented 5 years ago

Okay, cool. It doesn't look to me like this is solved by #6, because I'm pulling the data from the database. Maybe I'm missing something?

mstuart1 commented 5 years ago

I see now, based on the analysis done in #6, all of the samples missing gen_ids were from earlier years. This new Rdata file you've provided has 2017/2018 fish missing gen_id. I will re-do the analysis and see what is going on.

katcatalano commented 5 years ago

I pulled from the database again, and there are 172 samples without gen_ids now (in https://github.com/katcatalano/parentage/blob/master/172_nogenid.rds, 172_nogenid.rds). Not sure if you're still working on this because the issue is now closed.

mstuart1 commented 5 years ago

Still finding fish in seq03-33 without a gen_id. I think this is because we refiltered and included more individuals. These changes have to be made in the leyteBuildDB > db_corrections.Rmd script because they have to be reproduced if the database is restored from an older backup. The plan is to: 1) assign all fish in the filtered genepop with a gen_id. done 2) run regenotype analysis to make sure any regenotyped fish are assigned the same gen_id. done 3) run identity analysis to make sure any recaptured fish are assigned the same gen_id. done