kausmees / GenoCAE

Convolutional autoencoder for genotype data
BSD 3-Clause "New" or "Revised" License
15 stars 10 forks source link

Projecting non-population metadata #34

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

I'm trying to project sample metadata onto a trained GenoCAE model,

I'm wondering if this is a formatting issue with my --superpops input file (e.g. i have >2 columns, and some rows have NAs for missing metadata), or if GenoCAE is hard-code to only accept superpopulation-type metadata.

I'm attaching here an example of my metadata. merged.metadata.csv

Command

python3 run_gcae.py project --datadir /shared/bms20/projects/MND_ALS/SNP_VCFs/merged/ --data merged.SNPS_filtered.plink --trainedmodeldir als_out --model_id M1 --train_opts_id ex3  --data_opts_id b_0_4 --superpops /home/bms20/projects/MND_ALS/SNP_VCFs/merged/merged.metadata.csv

Output

...
...
Projecting epochs: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
Already projected: []
In DG.get_train_set: number of -1.0 genotypes in train: 5689140
In DG.get_train_set: number of -9 genotypes in train: 0
In DG.get_train_set: number of 0 values in train mask: 0
2022-06-23 12:39:48.063862: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

______________________________ Building model ______________________________
Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'strides': 1}
Adding layer: BatchNormalization: {}
Adding layer: ResidualBlock2: {'filters': 8, 'kernel_size': 5}
--- conv1d  filters: 8 kernel_size: 5
--- batch normalization
--- conv1d  filters: 8 kernel_size: 5
--- batch normalization
Adding layer: MaxPooling1D: {'pool_size': 5, 'strides': 2, 'padding': 'same'}
Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'activation': 'elu'}
Adding layer: BatchNormalization: {}
Adding layer: Flatten: {}
Adding layer: Dropout: {'rate': 0.01}
Adding layer: Dense: {'units': 75}
Adding layer: Dropout: {'rate': 0.01}
Adding layer: Dense: {'units': 75, 'activation': 'elu'}
Adding layer: Dense: {'units': 2, 'name': 'encoded'}
Adding layer: Dense: {'units': 75, 'activation': 'elu'}
Adding layer: Dropout: {'rate': 0.01}
Adding layer: Dense: {'units': 75, 'activation': 'elu'}
Adding layer: Dropout: {'rate': 0.01}
Adding layer: Dense: {'units': 621496}
Adding layer: Reshape: {'target_shape': (77687, 8), 'name': 'i_msvar'}
Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'activation': 'elu'}
Adding layer: BatchNormalization: {}
Adding layer: Reshape: {'target_shape': (77687, 1, 8)}
Adding layer: UpSampling2D: {'size': (2, 1)}
Adding layer: Reshape: {'target_shape': (155374, 8)}
Adding layer: ResidualBlock2: {'filters': 8, 'kernel_size': 5}
--- conv1d  filters: 8 kernel_size: 5
--- batch normalization
--- conv1d  filters: 8 kernel_size: 5
--- batch normalization
Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'activation': 'elu', 'name': 'nms'}
Adding layer: BatchNormalization: {}
Adding layer: Conv1D: {'filters': 1, 'kernel_size': 1, 'padding': 'same'}
Adding layer: Flatten: {'name': 'logits'}
########################### epoch 2 ###########################
Reading weights from /shared/bms20/projects/GenoCAE/als_out/ae.M1.ex3.b_0_4.merged.SNPS_filtered.plink/weights/2
**Traceback (most recent call last):
  File "/shared/bms20/projects/GenoCAE/run_gcae.py", line 1011, in <module>
    plot_coords_by_superpop(coords_by_pop,"{0}/dimred_e_{1}_by_superpop".format(results_directory, epoch), superpopulations_file, plot_legend = epoch == epochs[0])
  File "/shared/bms20/projects/GenoCAE/utils/visualization.py", line 203, in plot_coords_by_superpop
    max_num_pops = max([len(superpop_dict[spop]) for spop in superpops])
ValueError: max() arg is an empty sequence**
bschilder commented 2 years ago

That said, as a test I provided some dummy metadata that better matched the pop, superpop metadata in the examples, but this produced the same error: [merged.metadata.csv](https://github.com/kausmees/GenoCAE/files/8967403/merged. Screenshot 2022-06-23 at 14 04 23 metadata.csv)

bschilder commented 2 years ago

I'm also noticing some of the internal functions seem to reference the .fam file, which in my case does not contain any phenotype metadata. so perhaps this is where things are going wrong:

https://github.com/kausmees/GenoCAE/blob/8f1d929b61998af47feca6df6abc90ef6bf0673a/utils/data_handler.py#L628 Screenshot 2022-06-23 at 14 09 49

cnettel commented 2 years ago

I would imagine that it shouldn’t care about the actual content, although empty entries just might cause some unexpected behavior. As your second file didn’t make any difference, that doesn’t seem to be the root cause.

To me, it even seems like this error could really be due to the list of individuals somehow being empty. That would imply ind_pop_list_train being empty.

Should I interpret your comment that you get something successfully with the alternate option, when no --superpop is specified, works? That one still retrieves the coordinates in the same way.

One observation that doesn’t bear out straight away, but which is worth noting, is that we could be relying on each pop only being present in one superpop. However, your second test example of English,Europe would satisfy that (while e.g. Male,0.94 would not).

I’m inclined to think that something else is going wrong, possibly even something related to the relatively small sample size, e.g. do we have some bug if the specified batch size is larger than the total sample size. What batch size are you using?

Från: Brian M. Schilder @.> Skickat: den 23 juni 2022 14:48 Till: kausmees/GenoCAE @.> Kopia: Subscribed @.***> Ämne: [kausmees/GenoCAE] Projecting non-population metadata (Issue #34)

I'm trying to project sample metadata onto a trained GenoCAE model,

I'm wondering if this is a formatting issue with my --superpops input file (e.g. i have >2 columns, and some rows have NAs for missing metadata), or if GenoCAE is hard-code to only accept superpopulation-type metadata.

I'm attaching here an example of my metadata. merged.metadata.csv https://github.com/kausmees/GenoCAE/files/8967246/merged.metadata.csv

Command

python3 run_gcae.py project --datadir /shared/bms20/projects/MND_ALS/SNP_VCFs/merged/ --data merged.SNPS_filtered.plink --trainedmodeldir als_out --model_id M1 --train_opts_id ex3 --data_opts_id b_0_4 --superpops /home/bms20/projects/MND_ALS/SNP_VCFs/merged/merged.metadata.csv

Output

Projecting epochs: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20] Already projected: [] In DG.get_train_set: number of -1.0 genotypes in train: 5689140 In DG.get_train_set: number of -9 genotypes in train: 0 In DG.get_train_set: number of 0 values in train mask: 0 2022-06-23 12:39:48.063862: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

__ Building model __ Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'strides': 1} Adding layer: BatchNormalization: {} Adding layer: ResidualBlock2: {'filters': 8, 'kernel_size': 5} --- conv1d filters: 8 kernel_size: 5 --- batch normalization --- conv1d filters: 8 kernel_size: 5 --- batch normalization Adding layer: MaxPooling1D: {'pool_size': 5, 'strides': 2, 'padding': 'same'} Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'activation': 'elu'} Adding layer: BatchNormalization: {} Adding layer: Flatten: {} Adding layer: Dropout: {'rate': 0.01} Adding layer: Dense: {'units': 75} Adding layer: Dropout: {'rate': 0.01} Adding layer: Dense: {'units': 75, 'activation': 'elu'} Adding layer: Dense: {'units': 2, 'name': 'encoded'} Adding layer: Dense: {'units': 75, 'activation': 'elu'} Adding layer: Dropout: {'rate': 0.01} Adding layer: Dense: {'units': 75, 'activation': 'elu'} Adding layer: Dropout: {'rate': 0.01} Adding layer: Dense: {'units': 621496} Adding layer: Reshape: {'target_shape': (77687, 8), 'name': 'i_msvar'} Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'activation': 'elu'} Adding layer: BatchNormalization: {} Adding layer: Reshape: {'target_shape': (77687, 1, 8)} Adding layer: UpSampling2D: {'size': (2, 1)} Adding layer: Reshape: {'target_shape': (155374, 8)} Adding layer: ResidualBlock2: {'filters': 8, 'kernel_size': 5} --- conv1d filters: 8 kernel_size: 5 --- batch normalization --- conv1d filters: 8 kernel_size: 5 --- batch normalization Adding layer: Conv1D: {'filters': 8, 'kernel_size': 5, 'padding': 'same', 'activation': 'elu', 'name': 'nms'} Adding layer: BatchNormalization: {} Adding layer: Conv1D: {'filters': 1, 'kernel_size': 1, 'padding': 'same'} Adding layer: Flatten: {'name': 'logits'} ########################### epoch 2 ########################### Reading weights from /shared/bms20/projects/GenoCAE/als_out/ae.M1.ex3.b_0_4.merged.SNPS_filtered.plink/weights/2 Traceback (most recent call last): File "/shared/bms20/projects/GenoCAE/run_gcae.py", line 1011, in plot_coords_by_superpop(coords_by_pop,"{0}/dimrede{1}_by_superpop".format(results_directory, epoch), superpopulations_file, plot_legend = epoch == epochs[0]) File "/shared/bms20/projects/GenoCAE/utils/visualization.py", line 203, in plot_coords_by_superpop max_num_pops = max([len(superpop_dict[spop]) for spop in superpops]) ValueError: max() arg is an empty sequence

— Reply to this email directly, view it on GitHub https://github.com/kausmees/GenoCAE/issues/34 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBJKQJQPYJRXNWEPAFN7WDVQRMJXANCNFSM5ZUHYDIA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

bschilder commented 2 years ago

I would imagine that it shouldn’t care about the actual content, although empty entries just might cause some unexpected behavior. As your second file didn’t make any difference, that doesn’t seem to be the root cause.

Agreed, I also tried replacing all NAs with the string "NA" and this made no difference.

To me, it even seems like this error could really be due to the list of individuals somehow being empty. That would imply ind_pop_list_train being empty. Should I interpret your comment that you get something successfully with the alternate option, when no --superpop is specified, works?

Removing the --superpop arg did the trick!

Screenshot 2022-06-23 at 14 33 59

Now i think i just have to edit my .fam file to include the metadata. That said, it would be much easier if GenoCAE could write the projected coordinates used in the plots to csv files. That way, I could replot the data with whatever viz software I want (e.g. plotly). Is there a way to save the projected coordinates? Just realized that I can probably get all this info from the encoded_data.h5 file. Should be easy enough to plot the data from that.

That one still retrieves the coordinates in the same way. One observation that doesn’t bear out straight away, but which is worth noting, is that we could be relying on each pop only being present in one superpop. However, your second test example of English,Europe would satisfy that (while e.g. Male,0.94 would not). I’m inclined to think that something else is going wrong, possibly even something related to the relatively small sample size, e.g. do we have some bug if the specified batch size is larger than the total sample size. What batch size are you using?

I'm using the model params included in ex3.json, which has a batch size of 10. So my command to train the model looks like:

python3 run_gcae.py train --datadir /shared/bms20/projects/MND_ALS/SNP_VCFs/merged/ --data merged.SNPS_filtered.plink --trainedmodeldir als_out --model_id M1  --epochs 20 --save_interval 2  --train_opts_id ex3  --data_opts_id b_0_4
cnettel commented 2 years ago

The coordinates should be saved in the HDF5 files, if that’s all you need.

Från: Brian M. Schilder @.> Skickat: den 23 juni 2022 15:34 Till: kausmees/GenoCAE @.> Kopia: Carl Nettelblad @.>; Comment @.> Ämne: Re: [kausmees/GenoCAE] Projecting non-population metadata (Issue #34)

I would imagine that it shouldn’t care about the actual content, although empty entries just might cause some unexpected behavior. As your second file didn’t make any difference, that doesn’t seem to be the root cause.

Agreed, I also tried replacing all NAs with the string "NA" and this made no difference.

To me, it even seems like this error could really be due to the list of individuals somehow being empty. That would imply ind_pop_list_train being empty. Should I interpret your comment that you get something successfully with the alternate option, when no --superpop is specified, works?

Removing the --superpop arg did the trick!

dimred_e_2_by_pop.pdf https://github.com/kausmees/GenoCAE/files/8967618/dimred_e_2_by_pop.pdf

Now i think i just have to edit my .fam file to include the metadata. That said, it would be much easier if GenoCAE could write the projected coordinates used in the plots to csv files. That way, I could replot the data with whatever viz software I want (e.g. plotly). Is there a way to save the projected coordinates?

That one still retrieves the coordinates in the same way. One observation that doesn’t bear out straight away, but which is worth noting, is that we could be relying on each pop only being present in one superpop. However, your second test example of English,Europe would satisfy that (while e.g. Male,0.94 would not). I’m inclined to think that something else is going wrong, possibly even something related to the relatively small sample size, e.g. do we have some bug if the specified batch size is larger than the total sample size. What batch size are you using?

I'm using the model params included in https://github.com/kausmees/GenoCAE/blob/master/train_opts/ex3.json ex3.json, which has a batch size of 10. So my command to train the model looks like:

python3 run_gcae.py train --datadir /shared/bms20/projects/MND_ALS/SNP_VCFs/merged/ --data merged.SNPS_filtered.plink --trainedmodeldir als_out --model_id M1 --epochs 20 --save_interval 2 --train_opts_id ex3 --data_opts_id b_0_4

— Reply to this email directly, view it on GitHub https://github.com/kausmees/GenoCAE/issues/34#issuecomment-1164415009 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBJKQI3N6HQW63PE3ACXWLVQRRS7ANCNFSM5ZUHYDIA . You are receiving this because you commented.Message ID: @.***>

bschilder commented 2 years ago

Success!


epoch <- 20
h5 <- rhdf5::h5read("/home/bms20/projects/GenoCAE/als_out/ae.M1.ex3.b_0_4.merged.SNPS_filtered.plink/merged.SNPS_filtered.plink/encoded_data.h5", 
                    name=paste(epoch,"encoded_train",sep="_"))

dat <- data.frame(t(h5)) 
colnames(dat) <- paste0("dim",seq_len(ncol(dat)))

meta <- data.table::fread("/home/bms20/projects/MND_ALS/SNP_VCFs/merged/merged.metadata.csv")
colnames(meta) <- c("ID","Sex","Surv_years","Status")
dat <- cbind(meta, dat)

library("ggplot2")

gg <- ggplot(dat, aes(x=dim1, y=dim2, shape=Status, color=Surv_years)) +
  geom_point() +
  scale_color_viridis_c() + 
  theme_bw()

plotly::ggplotly(gg)

Screenshot 2022-06-23 at 15 33 38