hehouts / dynamic-duos

3 stars 1 forks source link

IBDMDB virome filenames not join-able #4

Closed hehouts closed 2 years ago

hehouts commented 2 years ago

Part of issue #2

Virome files look like: SM-76C9Y.tar, SM-9SIJC.tar, SM-7M8RR.tar and I cant join them with the IBDMDB metadata table, project id number, external_id, site_sub_coll id, or anything else obvious.

If I search the suffix of a virome file name, e.g. 76C9Y from SM-76C9Y.tar on the ibdmdb_mvx_only metadata (the ibdmdb metadata, filtered for viromes) it does return a match.

What is the column that it is finding a match for??

hehouts commented 2 years ago

This is described in chunk s1 in the metadata summary.

Here I am frantically looking for the column that its matching 76C9Y with ctrl-f to external_id HSM6XRR7 reducing the number of columns to char columns.

df <- ibdmdb_mvx_only %>% 
  filter(external_id == "HSM6XRR7") %>% 
  select(where(is.character))

df2 <- df[!map_lgl(df, ~ all(is.na(.)))]
df2

df2 only has 131 columns, so I just clicked through them.

found tube_a_viromics!!!!

which matches the exact virome filename: SM-76C9Y

hehouts commented 2 years ago

looking at these "tube" columns might be important later.

ibdmdb %>% 
  select(contains("tube")) %>% 
  select(where(is.character)) %>% 
  drop_na()

These are the tube string match containing columns:

"serum_tube_number_1_received_at_csmc"
"serum_tubes_number_2_4_received_at_mgh"
"tube_a_dna_rna""tube_a_metabolomics"
"tube_a_storage"
"tube_a_viromics"
"tube_b_fecal_calprotectin"
"tube_b_proteomics"
"stool_sample_id_tube_a_et_oh"
"sample_id_tube_b_no_preservative"
"tube_a_and_b_received_at_broad"    
hehouts commented 2 years ago

This is described in chunk s1 in the metadata summary Rmd