niekverw / ukbpheno

MIT License
27 stars 12 forks source link

A question about Study population #2

Closed zhangpicb closed 2 years ago

zhangpicb commented 2 years ago

Hi @lea1010 @niekverw

Thanks for your beautiful code !

I meet a problem when I use ukbpheno example data to learn how to use ukbpheno. I want to generate binary phenotype about HfInCad.

lst.HfInCad.case_control <- get_cases_controls(definitions=dfDefinitions_processed_expanded %>% filter(TRAIT=="HfInCad"), 
                                              lst.harmonized.data$lst.data,dfData.settings, 
                                              df_reference_date=df_reference_dt_v0)

df.HfInCad.casecontrol <- lst.HfInCad.case_control$df.casecontrol

I randomly select a individual to check

plot_individual_timeline(df.data.settings = dfData.settings,lst.data=lst.harmonized.data$lst.data,ind_identifier = 2003417)

This one is a case in HfInCad,but he/she only have Include_in_cases's code and don't have a HfInCad Study_population's code.

Thanks in advanced!

lea1010 commented 2 years ago

@zhangpicb

The individual timeline should show all diagnosis codes (by default). So if the individual is a case in HfInCad, you should see CAD diagnosis codes preceding HF diagnosis codes on his/her timeline. For HfInCad phenotype (CAD in "Study_population" & HF in Include_in_cases) we are looking at: Individuals who have HF are classified as cases among people with CAD, i.e. both cases and controls have CAD.

Does that answer your question?

zhangpicb commented 2 years ago

Hi @lea1010

Thanks for your quick reply!

I understand what your mean.This one ind_identifier = 2003417 have a CAD after he has diagnosed a HF. He has a ICD-10 code I509 in 2008,and ICD-10 I501 and I259 in 2018.That's mean HF diagnosis codes preceding CAD diagnosis codes on his/her timeline. He may had another reason ,not CAD, made he get HF.

Thanks in advanced!

lea1010 commented 2 years ago

Hi @zhangpicb Could you check the reference date for this participant and then check which was corresponding diagnosis code? Is it possible that the first CAD diagnosis comes from self-report/GP?

zhangpicb commented 2 years ago

Hi @lea1010

I check all diagnosis code,including self-report. And HF diagnosis codes preceding CAD diagnosis codes on his/her timeline. This person has few diagnosis codes.

Thanks in advanced!

lea1010 commented 2 years ago

Hi @zhangpicb If you first check the reference date for this participant in the case/control data table i.e. ref_date<-lst.HfInCad.case_control$df.casecontrol[identifier==2003417,reference_date]

And then check the data records for this participant on this reference date i.e. lapply(lst.harmonized.data$lst.data,function(x) {x[identifier==2003417 & eventdate==ref_date]})

Do you get some records back which correspond to a CAD diagnosis?

(P.S. Please do not paste the full results here)

zhangpicb commented 2 years ago

Hi @lea1010

Thanks your quick reply!

This person have a CAD diagnosis I259 in tte.death.icd10.primary in 2018,and HF code I501 in tte.death.icd10.secondary in 2018.

But plot_individual_timeline Result,We can see this one got Heart Failure code I509 in 2008.And identifier2003417 only have ICD-10 code related with HF and CAD diagnosis code.

Because this one had few diagnosis code,So I check it carefully.

And this message I had already mentioned in my previous reply.

Thanks in advanced!

lea1010 commented 2 years ago

Hi @zhangpicb

Indeed time of the event is not considered in the phenotype inclusion/exclusion stage. Therefore participants who have HF before CAD will still show up in the case_control table. But this will be reflected as negative Hx_days & first_diagnosis_days. Consider the the possible scenarios:

identifier reference_date count sum.epidur median.epidur max.epidur survival_days Death_primary Death_any Hx_days Fu_days Hx Fu Ref first_diagnosis_days Any
1111111 2018-02-18 2 0 0 0 NA 1 1 -3628 NA 2 1 2 -3628 2
2222222 2010-03-15 9 0 0 0 NA 1 1 -8504 201 2 2 1 -8504 2
3333333 2018-03-15 7 13 1 4 NA 1 1 -1 122 2 2 2 -1 2
4444444 2018-03-15 3 0 0 0 NA 1 1 0 86 2 2 2 0 2
5555555 2020-03-15 1 43 48 48 NA 1 1 NA 69 1 2 1 69 2

Participant 1111111 would be what you described :

Participant 2222222 received HF both before & after the CAD diagnosis (Hx=2 &Fu=2) but not on the reference_date i.e. first CAD (Ref=1):

Participant 3333333 received HF both before & after the CAD diagnosis (Hx=2 &Fu=2) and one of the HF diagnoses was received together with the first CAD diagnosis on the reference date (Ref=2):

Participant 4444444 received first HF diagnosis with CAD diagnosis (Hx=2 &Ref=2)and a second HF diagnosis after (Fu=2):

Participant 5555555 received HF only after the CAD diagnosis (Hx=1 &Fu=2 & Ref=1):

At the moment, we have opted to annotate the participants (and preserve the information) in the table so that there is more flexibility and to ensure consistent output of the function for different phenotypes (with/without "Study_population" & with/without reference_date input).

To analyze participants who have CAD diagnosis codes at least 90 days preceding any HF diagnosis codes, one can filter the table by lst.HfInCad.case_control$df.casecontrol[first_diagnosis_days>=90]

I hope this clarify the case_control table sufficiently and apologies for the confusions.

zhangpicb commented 2 years ago

Hi @lea1010

Thank you very much!And 2 questions still make me confused.

1.Ref column meaning.

If I use get_cases_controls without study population as input,I get df.casecontrol table and Ref column Indicate if the diagnosis was made close to the reference date with a window (default: 0 day)

If I use get_cases_controls with study population as input,I get df.casecontrol table and Ref column Indicate if the 2 diagnosis code (one from study population and one from case,such as CAD & HF) made close to the reference date with a window (default: 0 day)

2.survival_days

Included case/control in case_control table,their Death_primary column and Death_any column without NA.It means All Participants in case_control table were died.Or any other explaination?

Thanks in advanced!

lea1010 commented 2 years ago

Hi @zhangpicb

  1. When you use get_case_controls() with Study_population e.g. HfInCad, you are essentially using the date of the first CAD event as the reference date and identify case/control status for HF. So it is the same as the scenario without Study_population:
  1. The death columns follows the same coding as Hx/Fu/Ref:
    • If the participant died of target disease (HF in the current example), Death_primary/ Death_any = 2 (depending whether HF is listed as primary cause or secondary)
    • If the participant had not died of target disease (at the time of censoring), Death_primary/ Death_any = 1
    • If the participant is an excluded case , Death_primary/ Death_any = -2
    • If the participant is an excluded control , Death_primary/ Death_any = -1

The survival days

I hope this is clear.

zhangpicb commented 2 years ago

Hi @lea1010

Thank you very much for clear reply!

HfInCad df.casecontrol table's Hx/Fu/Ref/Death_primary/ Death_any column follows the same code(if this one have HF),reference date is CAD reference date.

BTW,HfInCad GWAS summary statics was published or not?How to download this HfInCad GWAS summary statics?

Thanks again!ukbpheno is a great tools! Thank you very much for your carefully reply!

lea1010 commented 2 years ago

Hi @zhangpicb

Thanks for using our package!

The ukbpheno package is made to facilitate the phenotype generation and the definitions could differ per analysis / study question. HfInCad is an example phenotype when we considered the possible use cases during the development of the package. We currently do not have a GWAS study on this phenotype in UKB published but I agree it is intriguing to see the genetic factors of this phenotype.

zhangpicb commented 2 years ago

Thank you very much for your reply!