Identity Analysis parameters for Cervus

mstuart1 commented 5 years ago

@katcatalano @mpinsky

Since 2015, we have been using the identity analysis parameters of:

Minimum matching loci = 80% of total number of loci Fuzzy matching mismatch = 10% of total number of loci

For the most recent run (seqs 03-33, 1005 loci), this translated into 804 minimum matching loci and no more than 101 mismatching to be considered the same individual.

I have 4 fish, A, B, C, and D.

Fish A matches to B and C. Fish D matches to B and C. Fish B and C match to each other. But Fish A and D did not result in an identity match.

Looking at the full comparison, out of 857 typed loci in fish A, 753 matched fish D and only 83 were mismatched.

Because this was 51 loci shy of the cutoff, these were not considered a match.

Should we change the minimum match to 75% or would that allow in lower quality matches that are less favorable?

katcatalano commented 5 years ago

When I compared the genotypes of fish we knew were recaptured/regenotyped based on tag_id, the percent of mismatching genotypes was 2.64% and the standard deviation was 2.50%. Based on that, I think we should change the parameters to be Minimum matching loci= 94.86% total number of loci Fuzzy matching mismatch= 5.14% total number of loci

Thoughts?

mstuart1 commented 5 years ago

Currently the parameters used are 80% minimum matching and 10% fuzzy matching mismatches. Changing the percentages to 95% minimum match and 5% total mismatch would results in more fish with the same tag_ids that do not appear as genetic matches.

katcatalano commented 5 years ago

Agreed. I think the input parameters I suggested here would only be an improvement if we used them in an analysis with only loci where every fish is genotyped.

On Mon, May 6, 2019 at 4:59 PM Michelle Stuart notifications@github.com wrote:

Currently the parameters used are 80% minimum matching and 10% fuzzy matching mismatches. Changing the percentages to 95% minimum match and 5% total mismatch would results in more fish with the same tag_ids that do not appear as genetic matches.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-489774419, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHQF6OWMVQU7K7FBTMJJWDPUCLZZANCNFSM4HKWOD7Q .

katcatalano commented 5 years ago

@mstuart1 we talked about trying an identity analysis in Colony yesterday. I started that today on the DEENR node and I'll let you know what I get back when it's done.

katcatalano commented 5 years ago

Moving this to issue #13

mpinsky commented 5 years ago

This is an interesting and very worthwhile investigation. @katcatalano, what was the range of %matching loci and %mismatching loci for tag_id recaptures? It's interesting to know the range, not just the mean +/- SD.

mpinsky commented 5 years ago

I think there are two things to consider with genotyping error rates:

%mismatch
number of matching loci

It's super helpful to see that the former varies from 0-7% or so.

Do you have the # matching loci for the recaptures as well?

On Wed, May 8, 2019 at 9:07 AM katcat notifications@github.com wrote:

The range is 0.6-6.30% loci mismatching for tag_id recaptures. This is only for the 4 fish where there were tag_id recaptures that were resequenced. @mstuart1 https://github.com/mstuart1 does 4 fish sound like the right number to you?

The four fish are:

gen_id | percent_mismatch 1258 | 1.82% 1336 | 1.82% 1343 | 0.60% 1298 | 6.30%

I'm happy to discuss genotyping error rate more on this issue, but if we can, let's keep discussion about reassigning gen_ids to #13 https://github.com/pinskylab/genomics/issues/13 so @agdedrick https://github.com/agdedrick can follow more easily.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-490478555, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZHS5PWWDPQSU4CJKOLPULGCZANCNFSM4HKWOD7Q .

katcatalano commented 5 years ago

I saved the data frame with all relevant information as a text file and csv here and as R data here.

When I limited the calculations to fish that had been marked recap=Y (as in recaptured/regenotyped fish) I only found 4 fish. But when I instead filtered for fish with multiple occurrences of their tag_id and different ligation_ids I found 11 fish. Should these fish should all have a recap=Y status @mstuart1 ? For fish with the same tag_id but different gen_ids, we think this is coming from the issue where Cervus may not be handling missing genotypes correctly and so fish that are identical aren't being identified in the Cervus identity analysis. There are also 3 fish without gen_ids.

As context, the code I used to generate this table is below. Please let me know if anything jumps out as problematic.

#join seq17 and seq33 genetic data
common_loci_names <- intersect(names(dat_gen17_meta), names(dat_gen33_meta))
dat_gen17_common_loci <- dat_gen17_meta[, common_loci_names]
dat_gen33_common_loci <- dat_gen33_meta[, common_loci_names]

#pull together all fish_ids
fish_ids <- bind_rows(fish_ids17, fish_ids33)
#pull together the sequencing data with only the common loci
all_seq <- bind_rows(dat_gen17_common_loci, dat_gen33_common_loci) %>%
    select(-contains("00")) #get rid of loci with NA

##get ids for fish that are regenotyped
multiple_lig_fish <- bind_rows(fish_meta33, fish_meta17) %>% #get all the ligation_ids associated with a gen_id
    distinct(ligation_id, .keep_all = T) %>% #Michelle kept the best ligation, so there will be duplicates
    filter(ligation_id %!in% issues$ligation_id) %>% #check for no issue fish again. Nope, good.
    filter(!is.na(tag_id)) %>% #get only fish that have tag_ids
    group_by(tag_id) %>%
    filter(n()>1) %>% #get only fish that were sampled twice
    arrange(tag_id) %>% #visually inspect
    ungroup() %>%
    mutate(total_loci=ncol(all_seq)-1) %>%
    mutate(num_mismatch_loci="NA") %>%
    mutate(num_matching_loci="NA") %>%
    mutate(percent_mismatch="NA") %>%
    mutate(check_984="NA") #make a column that adds matching loci and not matching loci as a quick check that loop is working as expected, should be 984

##from Michelle 03/26/2019: Fish with the same sample_id and gen_id are lab regenotypes, fish with the same gen_id but different sample_ids are fish that were captured twice and regenotyped 

# use a for loop to move through tag_ids and 1) pull out both of the ligation_ids associated with that tag_id
#2) make a temporary date frame for each ligation of a fish with the same tag_id with ncol=number of SNPs in common between both sequencing events, where any locus that wasn't genotyped for both ligations is removed
#4) sum the loci that don't match between ligations of the same tag_id and divide that value by ncol(), put that value in the new "genotyping_error" df in a column "percent_mismatch" for each tag_id and sum the number of loci matching and put that in colum "num_loci_matching" in "genotyping_error" df

tag_ids <- multiple_lig_fish %>%
    #filter(recap=="Y") %>% #here is where I can either look at only fish marked as recaptures by their recap status, or look  at fish with multiple occurences of the tag_id
    select(tag_id) %>%
    distinct(tag_id)

#create empty data frame to add individuals' genotype data to 
genotyping_error <- as.data.frame(matrix(nrow=0, ncol=(ncol(multiple_lig_fish))))
names(genotyping_error) <- names(multiple_lig_fish)

for(i in 1:nrow(tag_ids)){

    tag_id_eval <- tag_ids$tag_id[i]

    regeno_indv_eval <- multiple_lig_fish %>%
        filter(tag_id == tag_id_eval) 

    ligations_eval <- regeno_indv_eval$ligation_id

    seq_eval1 <-  all_seq %>%
        filter(ligation_id == ligations_eval[1])

    seq_eval2 <-  all_seq %>%
        filter(ligation_id == ligations_eval[2])

    sum(names(seq_eval1) == names(seq_eval2)) #check that loci compared are the same, should be n_loci

    regeno_indv_eval2 <- regeno_indv_eval %>%
        mutate(num_mismatch_loci= sum((seq_eval1[,2:ncol(seq_eval1)]!= seq_eval2[,2:ncol(seq_eval2)]))) %>%
        mutate(num_matching_loci= sum((seq_eval1[,2:ncol(seq_eval1)]== seq_eval2[,2:ncol(seq_eval2)])))%>%
        mutate(percent_mismatch= sum((seq_eval1[,2:ncol(seq_eval1)]!= seq_eval2[,2:ncol(seq_eval2)]))/(ncol(all_seq)-1)*100)%>%
        mutate(check_984=num_mismatch_loci+num_matching_loci)
    genotyping_error <- bind_rows(genotyping_error, regeno_indv_eval2)
}

mpinsky commented 5 years ago

Super helpful. Looks like >= 900 loci matching, at least for these.

On Wed, May 8, 2019 at 10:30 AM katcat notifications@github.com wrote:

I saved the data frame with all relevant information as a text file and csv here https://github.com/katcatalano/parentage/blob/master/text_file/genotyping_error.csv and as R data here https://github.com/katcatalano/parentage/blob/master/r_data/genotyping_error.Rds .

When I limited the calculations to fish that had been marked recap=Y (as in recaptured/regenotyped fish) I only found 4 fish. But when I instead filtered for fish with multiple occurrences of their tag_id and different ligation_ids I found 11 fish. Should these fish should all have a recap=Y status @mstuart1 https://github.com/mstuart1 ? For fish with the same tag_id but different gen_ids, we think this is coming from the issue where Cervus may not be handling missing genotypes correctly and so fish that are identical aren't being identified in the Cervus identity analysis. There are also 3 fish without gen_ids.

As context, the code I used to generate this table is below. Please let me know if anything jumps out as problematic.

join seq17 and seq33 genetic data

common_loci_names <- intersect(names(dat_gen17_meta), names(dat_gen33_meta)) dat_gen17_common_loci <- dat_gen17_meta[, common_loci_names] dat_gen33_common_loci <- dat_gen33_meta[, common_loci_names]

pull together all fish_ids

fish_ids <- bind_rows(fish_ids17, fish_ids33)

pull together the sequencing data with only the common loci

all_seq <- bind_rows(dat_gen17_common_loci, dat_gen33_common_loci) %>% select(-contains("00")) #get rid of loci with NA

get ids for fish that are regenotyped

multiple_lig_fish <- bind_rows(fish_meta33, fish_meta17) %>% #get all the ligation_ids associated with a gen_id distinct(ligation_id, .keep_all = T) %>% #Michelle kept the best ligation, so there will be duplicates

filter(ligation_id %!in% issues$ligation_id) %>% #check for no issue fish again. Nope, good.
filter(!is.na(tag_id)) %>% #get only fish that have tag_ids
group_by(tag_id) %>%
filter(n()>1) %>% #get only fish that were sampled twice
arrange(tag_id) %>% #visually inspect
ungroup() %>%
mutate(total_loci=ncol(all_seq)-1) %>%
mutate(num_mismatch_loci="NA") %>%
mutate(num_matching_loci="NA") %>%
mutate(percent_mismatch="NA") %>%
mutate(check_984="NA") #make a column that adds matching loci and not matching loci as a quick check that loop is working as expected, should be 984
from Michelle 03/26/2019: Fish with the same sample_id and gen_id are lab regenotypes, fish with the same gen_id but different sample_ids are fish that were captured twice and regenotyped

use a for loop to move through tag_ids and 1) pull out both of the ligation_ids associated with that tag_id

2) make a temporary date frame for each ligation of a fish with the same tag_id with ncol=number of SNPs in common between both sequencing events, where any locus that wasn't genotyped for both ligations is removed

4) sum the loci that don't match between ligations of the same tag_id and divide that value by ncol(), put that value in the new "genotyping_error" df in a column "percent_mismatch" for each tag_id and sum the number of loci matching and put that in colum "num_loci_matching" in "genotyping_error" df

tag_ids <- multiple_lig_fish %>%

filter(recap=="Y") %>% #here is where I can either look at only fish marked as recaptures by their recap status, or look at fish with multiple occurences of the tag_id
select(tag_id) %>%
distinct(tag_id)
create empty data frame to add individuals' genotype data to

genotyping_error <- as.data.frame(matrix(nrow=0, ncol=(ncol(multiple_lig_fish)))) names(genotyping_error) <- names(multiple_lig_fish)

for(i in 1:nrow(tag_ids)){
tag_id_eval <- tag_ids$tag_id[i]

regeno_indv_eval <- multiple_lig_fish %>%
    filter(tag_id == tag_id_eval)

ligations_eval <- regeno_indv_eval$ligation_id

seq_eval1 <-  all_seq %>%
    filter(ligation_id == ligations_eval[1])

seq_eval2 <-  all_seq %>%
    filter(ligation_id == ligations_eval[2])

sum(names(seq_eval1) == names(seq_eval2)) #check that loci compared are the same, should be n_loci

regeno_indv_eval2 <- regeno_indv_eval %>%
    mutate(num_mismatch_loci= sum((seq_eval1[,2:ncol(seq_eval1)]!= seq_eval2[,2:ncol(seq_eval2)]))) %>%
    mutate(num_matching_loci= sum((seq_eval1[,2:ncol(seq_eval1)]== seq_eval2[,2:ncol(seq_eval2)])))%>%
    mutate(percent_mismatch= sum((seq_eval1[,2:ncol(seq_eval1)]!= seq_eval2[,2:ncol(seq_eval2)]))/(ncol(all_seq)-1)*100)%>%
    mutate(check_984=num_mismatch_loci+num_matching_loci)
genotyping_error <- bind_rows(genotyping_error, regeno_indv_eval2)
}

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-490509324, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZANBW2U3UMHHXF5Y3TPULPY3ANCNFSM4HKWOD7Q .

mstuart1 commented 5 years ago

I found 7 fish (14 observations) who have the same tag and have been genotyped. One of the fish's samples was placed in the known issues list, so 6 pairs of fish were compared for mismatch proportion. They can be seen in this table. One of these pairs has a large mismatch proportion (25%). I looked at the pitscan history and the observations of this fish seem accurate, which leads me to doubt the integrity of the lab sample.

mpinsky commented 5 years ago

Wow, that mismatching proportion is super high. Is there a lab reason we shouldn't trust one of those genotypes? Or do we chalk this up as an outlier that we'd never be able to match without the PIT tag?

On Wed, May 8, 2019 at 12:40 PM Michelle Stuart notifications@github.com wrote:

I found 7 fish (14 observations) who have the same tag and have been genotyped. One of the fish's samples was placed in the known issues list, so 6 pairs of fish were compared for mismatch proportion. They can be seen in this table https://github.com/pinskylab/genomics/blob/master/data/genotyped-tag-recaps.csv. One of these pairs has a large mismatch proportion (25%). I looked at the pitscan history and the observations of this fish seem accurate, which leads me to doubt the integrity of the lab sample.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-490560643, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZGMEU3QIOC3KDAC5XTPUL665ANCNFSM4HKWOD7Q .

mstuart1 commented 5 years ago

I'm going to look into the work history of the samples to see if there are any notes about problems on the plates.

katcatalano commented 5 years ago

I'm curious why that fish with high mismatch didn't appear in my table of fish with repeat occurrences of tag_ids, and why their are 7 vs 11 fish. @mpinsky , @mstuart1 and I just talked about this and @mstuart1 is going to check out what she gets running my code to see what the differences are. @mstuart1 here is the seq17 data I used in my code and here is the seq33 data I used. Once you load those two data frames you should be able to run the code I pasted above. I filtered out fish with ligation_ids in the "issues" table, so I'm now wondering how they snuck in there....

Thanks for checking this out, Michelle!

mstuart1 commented 5 years ago

It looks like one of the fish in the high mismatch pair is not in @katcatalano 's seq17 nor seq33 data so it didn't match the filter looking for more than one tag instance. I had 7 fish instead of 11 because I was only looking at recaptures (different tissue samples) and you included regenotypes (same tissue samples).

The method depicted above by @katcatalano seems like a great way to determine genotype error because it looks at genotyping of the same tissue sample. In order to tease out the best parameters to use for Cervus identity analysis, which looks for hard cutoff numbers instead of percentages, I looked at all sample comparisons. This is a histogram of the distribution of mismatch proportions among all samples. It makes sense because most fish are not recaptures. You can see some small specks on the lower left where identity matches are. This is a zoomed in histogram showing only fish that have a mismatch proportion below 30%. It suggests a bimodal distribution. Looking at the metadata of fish that have a 25% mismatch proportion, they are the same fish. I picked one that couldn't be a true recapture because the fish was 6.9 cm and 10 days later was 9.4 cm. Looking at photos, it is clearly the same fish.
MICR0008 (9) MICR0030 (6)

Cervus returns identity matches for fish with less than or equal to 10% mismatch rate. I am going to incrementally increase the mismatch rate and examine all of the new pairs until "noise" starts to creep in.

My assumption is that we would rather miss true recaptures than include false recaptures.

katcatalano commented 5 years ago

@agdedrick see above

mpinsky commented 5 years ago

I really like the two histograms. Super helpful, and I agree on interpretation of bimodality.

Not sure I followed your logic on the fish in the photos. You think they are the same fish, but with the wrong length measurements? Or do you think they are different fish because the mismatch proportion is really high?

On Wed, May 8, 2019 at 4:18 PM katcat notifications@github.com wrote:

@agdedrick https://github.com/agdedrick see above

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-490635290, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZGBK24LKYI5AEWEZQLPUMYRBANCNFSM4HKWOD7Q .

mstuart1 commented 5 years ago

I think they are the same fish because their stripe shapes, tail color pattern, and body shading are all the same. I think this fish might have had a burst of growth but will double check by measuring on image j. My logic is that if a fish with mismatch of 24% is still the same fish based on photo evidence and captured in the same location, 24% is most likely not a "too much" value for mismatch.

Update: measuring on image J was difficult because the fish is closer to the camera than the data sheet and so the perspective is distorted. I measured Cecil's thumb joint in photo A, measured the fish in units of thumb joints, measured Cecil's thumb joint in photo B, and measured that fish, and it appears to have grown, there is an ~115% difference but this is an extremely course measurement.

mstuart1 commented 5 years ago

Here is a report of results when attempting to change Cervus parameters.

Given a dataset of 75 fish with 7 known identity matches due to tag recapture, lowering the number of required matching loci incrementally to 10% results in 0 false positives.

???

katcatalano commented 5 years ago

Thanks for this awesome analysis, Michelle! As we are discussing in the office right now, I think maybe we were making the identity analysis in Cervus too conservative (missing true matches) using the high cutoff for number of matching loci. It seems from this paper that maybe Cervus still has high power with few loci (using SNPs) Selected excerpts: The aim of this study was to assess the usefulness and limitations of microsatellite markers and SNPs for paternity and identity analysis in a species with extremely low genetic variability (using Cervus).... We selected subsets of 480, 240, 120, 60, 30 and 15 loci in two ways: (i) selecting the most polymorphic loci, and (ii) taking a random subset of loci. We repeated the simulations using these subsets....there were three known mother–father–offspring trios in this data set....Identity analysis with 960 SNP loci revealed no genotypes among 50 animals that matched exactly. When the analysis was repeated with either the most heterozygous subset of loci or with randomly selected loci, no matching genotypes were found even when as few as 15 loci were used.

So maybe a lowe cutoff is the way to go, especially since you turned up no false positives and found all the true positives?

mpinsky commented 5 years ago

I’m confused by this. Our fish should match at 10% of loci just by chance (about 100 loci, right?). With such a low match threshold, I would expect Cervus to return everything as a match.

Or maybe I’m not understanding the analysis or the Cervus algorithm.

On Fri, May 10, 2019 at 1:16 PM katcat notifications@github.com wrote:

Thanks for this awesome analysis, Michelle! As we are discussing in the office right now, I think maybe we were making the identity analysis in Cervus too conservative (missing true matches) using the high cutoff for number of matching loci. It seems from this paper https://www.nature.com/articles/hdy200973 that maybe Cervus still has high power with few loci (using SNPs) Selected excerpts: The aim of this study was to assess the usefulness and limitations of microsatellite markers and SNPs for paternity and identity analysis in a species with extremely low genetic variability (using Cervus).... We selected subsets of 480, 240, 120, 60, 30 and 15 loci in two ways: (i) selecting the most polymorphic loci, and (ii) taking a random subset of loci. We repeated the simulations using these subsets....there were three known mother–father–offspring trios in this data set....Identity analysis with 960 SNP loci revealed no genotypes among 50 animals that matched exactly. When the analysis was repeated with either the most heterozygous subset of loci or with randomly selected loci, no matching genotypes were found even when as few as 15 loci were used.

So maybe a lowe cutoff is the way to go, especially since you turned up no false positives and found all the true positives?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-491364303, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZFMZZCQOBD5AEBEP5LPUWUXHANCNFSM4HKWOD7Q .

-- Please excuse, sent from a device with tiny keys...

katcatalano commented 5 years ago

I'll look at the cervus algorithm more after tomorrow, but I think it's important to note that cervus is given population level allele frequencies as prior information. In the test case Michelle is describing, those allele frequencies were for the whole ~2,400 fish samples. So if the given set of 10 loci were all highly polymorphic, those 10 loci could be enough to inform an identity match. For sibling or parentage matches the power is much lower, but here we are only concerned with matching identical fish. It's encouraging that Michelle's test was able to decipher false matches and true matches.

On Fri, May 10, 2019 at 6:19 PM Malin Pinsky notifications@github.com wrote:

I’m confused by this. Our fish should match at 10% of loci just by chance (about 100 loci, right?). With such a low match threshold, I would expect Cervus to return everything as a match.

Or maybe I’m not understanding the analysis or the Cervus algorithm.

On Fri, May 10, 2019 at 1:16 PM katcat notifications@github.com wrote:

Thanks for this awesome analysis, Michelle! As we are discussing in the office right now, I think maybe we were making the identity analysis in Cervus too conservative (missing true matches) using the high cutoff for number of matching loci. It seems from this paper https://www.nature.com/articles/hdy200973 that maybe Cervus still has high power with few loci (using SNPs) Selected excerpts: The aim of this study was to assess the usefulness and limitations of microsatellite markers and SNPs for paternity and identity analysis in a species with extremely low genetic variability (using Cervus).... We selected subsets of 480, 240, 120, 60, 30 and 15 loci in two ways: (i) selecting the most polymorphic loci, and (ii) taking a random subset of loci. We repeated the simulations using these subsets....there were three known mother–father–offspring trios in this data set....Identity analysis with 960 SNP loci revealed no genotypes among 50 animals that matched exactly. When the analysis was repeated with either the most heterozygous subset of loci or with randomly selected loci, no matching genotypes were found even when as few as 15 loci were used.

So maybe a lowe cutoff is the way to go, especially since you turned up no false positives and found all the true positives?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pinskylab/genomics/issues/22#issuecomment-491364303 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABQTJZFMZZCQOBD5AEBEP5LPUWUXHANCNFSM4HKWOD7Q

.

-- Please excuse, sent from a device with tiny keys...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22#issuecomment-491446172, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHQF6PGFRSCO7IHTKQWUF3PUXYHNANCNFSM4HKWOD7Q .

mstuart1 commented 5 years ago

Running all samples through the identity analysis with 10% matching loci and 10% mismatching (102 and 101 respectively) resulted in an increase of identity matches from 299 at 80% matching to 421 at 10% matching. Only about 120 more matches in a population of 2790 samples doesn't seem like too many. I am going to examine them now to see if they raise any red flags (different sites, shrink in size, were caught on the same day at different sites, different tag ids on the same day) etc.

Update 4 of the pairs of fish were from different sites. There were photos for 3 of those pairs and they did not look like the same fish. Trying again at cervus' recommended 50% matching loci to see how the number of recaptures changes.

mstuart1 commented 5 years ago

After reading the Cervus Identity Manual, it looks like Cervus uses the hard numbers of required matching loci, with their default being half the number of loci used in the allele frequency analysis. Our use of 80% of the number of loci in the allele frequency analysis looks conservative compared to that. Fuzzy matching allows inexact matches. I believe the reason we are not getting a huge increase in false recaptures when we decrease the number of required matching loci to 10% is because we are still only allowing a 10% mismatch.

Cervus only calculates the probability of recapture (pid) or probability of sibling (pidsib) if the match is exact. None of our matches are exact.

mstuart1 commented 5 years ago

Re-ran the cervus identity analysis with 50% matching loci and only 1 pair of fish was dropped as a recapture, from 421 resulting paired fish to 420 paired fish. The same 4 recaptures that were flagged as being from different sites and looking like different fish in the photos (for one of the pairs) still show up as recaptures. It would be helpful to talk through these results as a group.

katcatalano commented 5 years ago

Maybe we should all sit down on Monday when @agdedrick is here?

On Thu, May 16, 2019 at 1:44 PM Michelle Stuart notifications@github.com wrote:

Re-ran the cervus identity analysis with 50% matching loci and only 1 pair of fish was dropped as a recapture, from 421 resulting paired fish to 420 paired fish. The same 4 recaptures that were flagged as being from different sites and looking like different fish in the photos (for one of the pairs) still show up as recaptures. It would be helpful to talk through these results as a group.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22?email_source=notifications&email_token=AGHQF6OUSSJWQQC3UEIIW2DPVWMN5A5CNFSM4HKWOD72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSRWEI#issuecomment-493165329, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHQF6KY3B2PEPDUZ3VMW7LPVWMN5ANCNFSM4HKWOD7Q .

mstuart1 commented 5 years ago

3 of the pairs were on the same extraction and digest plates and out of the 6 samples involved, 3 of those samples were in the same row on the plate so there is a high chance for cross contamination. These samples have been added to the known_issues table and the gene pop is being recreated without them.

mpinsky commented 5 years ago

CERVUS uses #loci matching and #loci mismatching, right? Rather than %matching and %mismatching?

We should calculate #mismatching/(#mismatching + #matching) for each apparent identity match. I bet the fraction is quite high for some (>25%).

On Thu, May 16, 2019 at 2:24 PM Michelle Stuart notifications@github.com wrote:

3 of the pairs were on the same extraction and digest plates and out of the 6 samples involved, 3 of those samples were in the same row on the plate so there is a high chance for cross contamination. These samples have been added to the known_issues table and the gene pop is being recreated without them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22?email_source=notifications&email_token=ABQTJZEJMH7KJ7L5GX34ZO3PVWRFXA5CNFSM4HKWOD72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSVGLI#issuecomment-493179693, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZHAPEN3EFG43SWNXLTPVWRFXANCNFSM4HKWOD7Q .

-- Please excuse, sent from a device with tiny keys...

mstuart1 commented 5 years ago

The calculation of #mismatching/(#mismatching + # matching) is standard procedure for the identity analysis. I plot it every time to make sure nothing looks out of place.

mpinsky commented 5 years ago

From comparison of known regenotypes and tag recaptures, %mismatching is always <5%, right, except for one tag recapture at 25%? That suggests our genotyping error is about 5%, which would be a good cutoff for identity matches.

On Thu, May 16, 2019 at 4:13 PM Michelle Stuart notifications@github.com wrote:

The calculation of #mismatching/(#mismatching + # matching) is standard procedure for the identity analysis. I plot it every time to make sure nothing looks out of place.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22?email_source=notifications&email_token=ABQTJZEIRSZS5ADXBSVYUJ3PVW6ADA5CNFSM4HKWOD72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVS556A#issuecomment-493215480, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZDT7DLH7MMNBVAWJJLPVW6ADANCNFSM4HKWOD7Q .

-- Please excuse, sent from a device with tiny keys...

mstuart1 commented 5 years ago

When looking at all comparisons (not cervus identity analysis results), there was a case where 2 samples that appeared to be a recapture based on capture location, number of loci matching, and photographs had a 25% mismatch proportion and there was a case where 2 samples did not appear to be a recapture based on photographs had a 5% mismatch proportion.

Katrina calculated our genotype error rate to be 6%.

mpinsky commented 5 years ago

The 25% seems too high to believe for a recapture unless our genotyping error rate is ridiculously high (in which case we have other problems).

On Thu, May 16, 2019 at 4:28 PM Michelle Stuart notifications@github.com wrote:

When looking at all comparisons (not cervus identity analysis results), there was a case where 2 samples that appeared to be a recapture based on capture location, number of loci matching, and photographs had a 25% mismatch proportion and there was a case where 2 samples did not appear to be a recapture based on photographs had a 5% mismatch proportion.

Katrina calculated our genotype error rate to be 6%.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22?email_source=notifications&email_token=ABQTJZH4APLNR2635ZQV47DPVW7WHA5CNFSM4HKWOD72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVS7BMY#issuecomment-493220019, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQTJZCGUBAH6D6O3NRP64DPVW7WHANCNFSM4HKWOD7Q .

-- Please excuse, sent from a device with tiny keys...

mstuart1 commented 5 years ago

That was the tagged recapture that did not return a genetic recapture.

mstuart1 commented 5 years ago

I tried 7 different tests of cervus, results here. @agdedrick the bottom two tests are looking at changes in mismatching loci allowance. Let me know which set of parameters you'd like me to use and we can move forward.

katcatalano commented 5 years ago

Do we know anything about the tagged recapture that didn't show up as a regenotype? Is it possible that the sample was contaminated? Or was it missed because the matching/mismatching parameters values in the Cervus run? Overall, are all fish with the same tag_id that were regenotyped (except for this one) picked up by Cervus?

On Fri, May 17, 2019 at 12:33 PM Michelle Stuart notifications@github.com wrote:

I tried 7 different tests of cervus, results here https://pinskylab.github.io/genomics/scripts/determine-cervus-parameters.nb.html. @agdedrick https://github.com/agdedrick the bottom two tests are looking at changes in mismatching loci allowance. Let me know which set of parameters you'd like me to use and we can move forward.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pinskylab/genomics/issues/22?email_source=notifications&email_token=AGHQF6O3G2CDIRFRIQPMZRDPV3M45A5CNFSM4HKWOD72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVVHUZY#issuecomment-493517415, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHQF6J4X2JE4IHKOZF4J4TPV3M45ANCNFSM4HKWOD7Q .

mstuart1 commented 5 years ago

The tagged recapture that didn’t show up as a recapture in cervus was because there were fewer than 80% loci present, so it was not available for comparison. It did show up in the tests at lower percentages. All tagged recaptures were identified as genetic recaptures by cervus once we lowered the cutoff threshold to below the missing data threshold set by filtering.

mstuart1 commented 5 years ago

Meeting to figure out answers to these questions: If we get a false positive, we eliminate a fish that should've been considered or we add a fish to the wrong parent/offspring pool.

Run cervus with 50% matching loci and 10% mismatching allowed.

Michelle to update the recaptured-fish.RData file to be called fish-obs.Rdata and to contain the columns fish_table_id, gen_id, tag_id, fish_indiv where the fish_table_id is the event id from the clownfish table in the Leyte database, the gen_id is representative of the successful genotype in the rows of the events where the fish was successfully genotyped, the tag_id is the pit tag in the fish at each event time, and the fish_indiv is the same for all observations of the same fish (tag and genotype events). This table is for fish that were marked in some way - either through a successful genotype or a PIT tag - regardless of whether they were ever recaptured.

Michelle will also remove gen_id from the leyte database.

mstuart1 commented 5 years ago

The new table that contains a fish_indiv identifier for all fish that were ever genotyped or tagged has been created. Fish that were handled more than once have the same fish_indiv identifier.

pinskylab / genomics

Identity Analysis parameters for Cervus #22

join seq17 and seq33 genetic data

pull together all fish_ids

pull together the sequencing data with only the common loci

get ids for fish that are regenotyped

filter(ligation_id %!in% issues$ligation_id) %>% #check for no issue fish again. Nope, good.

from Michelle 03/26/2019: Fish with the same sample_id and gen_id are lab regenotypes, fish with the same gen_id but different sample_ids are fish that were captured twice and regenotyped

use a for loop to move through tag_ids and 1) pull out both of the ligation_ids associated with that tag_id

2) make a temporary date frame for each ligation of a fish with the same tag_id with ncol=number of SNPs in common between both sequencing events, where any locus that wasn't genotyped for both ligations is removed

4) sum the loci that don't match between ligations of the same tag_id and divide that value by ncol(), put that value in the new "genotyping_error" df in a column "percent_mismatch" for each tag_id and sum the number of loci matching and put that in colum "num_loci_matching" in "genotyping_error" df

filter(recap=="Y") %>% #here is where I can either look at only fish marked as recaptures by their recap status, or look at fish with multiple occurences of the tag_id

create empty data frame to add individuals' genotype data to