nih-cfde / update-content-registry

Code and workflows for adding content to the content registry.
https://app-staging.nih-cfde.org/
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

[DNM] retrieve list of portal ids and filter input lists with it #68

Open raynamharris opened 2 years ago

raynamharris commented 2 years ago

working on a solution for #52. i'm not sure i like this approach, but it is progress.

first, i queried the catalog to get a current list of ids with portal pages. (this could probably be done with fewer lines of code.

# retrieve list of ids with portal pages
url = "https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:gene/id@sort(id)"  
response = urlopen(url)
data_json = json.loads(response.read())
portal_pages = pd.json_normalize(data_json)    
portal_page_ids = portal_pages["id"].to_numpy()

then, i created a id_list2 which filters id_list and use that for making markdown pages.

# load up each ID in id_list file - does it have a portal page?
 id_list2 = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line not in portal_page_ids:
                print(f"ERROR: requested input id {line} not found in portal_page_ids", file=sys.stderr)
                print(f"skipping!", file=sys.stderr)
                continue
                #sys.exit(-1)
            id_list2.add(line)

print(f"Loaded {len(id_list2)} IDs contained in both the ref list and the portal page list.",
      file=sys.stderr)

template_name = 'alias_tables'
for cv_id in sorted(id_list2):
....

technically this is working, because the output looks like this:

Running with term: gene
Using output dir output_pieces_gene/00-alias for pieces.
Loaded 24620 reference IDs from data/validate/ensembl_genes.tsv
ERROR: requested input id ENSG00000000001 not found in ref_id_list
Loaded 19972 IDs from data/inputs/STAGING_PORTAL__available_genes__2022-08-19.txt
ERROR: requested input id ENSG00000204616 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000000001 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000262302 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000275778 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000278992 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000279846 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000281994 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000282232 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000288373 not found in portal_page_ids
skipping!
ERROR: requested input id ENSG00000288708 not found in portal_page_ids
skipping!
Loaded 19962 IDs contained in both the ref list and the portal page list.

however, something like this would have to be added to every script. i wonder if there is a way to do this as like a common script...

raynamharris commented 2 years ago

i moved a chunk of the code to cfde_common and made it a function

def get_portal_page_ids(term):
    # get list of ids with portal pages from json
    url = f'https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:{term}/id@sort(id)' 
    response = urlopen(url)
    data_json = json.loads(response.read())
    df = pd.json_normalize(data_json)    
    ids = df["id"].to_numpy()   
    print(f"Loaded {len(ids)} {term} IDs in the CFDE Portal from {url}")    
    return(ids)

Then, I added these 3 lines of code to the python scripts and used id_list_filtered as the input for the make markdown function.

    # filter by ids with a page in the portal
    id_pages = cfde_common.get_portal_page_ids(term)
    id_list_filtered = [value for value in id_list if value in id_pages]        
    print(f"Using  {len(id_list_filtered)} {term} IDs.")

Looks like this:

Screen Shot 2022-09-08 at 2 27 27 PM

raynamharris commented 2 years ago

working now for all inputs. especially useful for anatomy and compound

Screen Shot 2022-09-08 at 2 37 20 PM Screen Shot 2022-09-08 at 2 38 57 PM

raynamharris commented 2 years ago

okay, i have a snakemake rule that is working okay. when you run bash scripts/retrieve-ids.sh form the the command line, it will download only missing files. however, snakemake always re-dowloads all five.

rule retrieve: 
    message:
        f"retrieve list of ids in the registry"
    output:
        "data/validate/anatomy.csv",
        "data/validate/disease.csv",
        "data/validate/compound.csv",
        "data/validate/gene.csv",
        "data/validate/protein.csv",
    shell: """
        bash scripts/retrieve-ids.sh
    """

retrieve-ids.sh looks somethign like this:

if [ ! -f data/validate/anatomy.csv ]
then
    echo "Downloading csv of ids for anatomy."
    curl -L "https://app.nih-cfde.org/ermrest/catalog/1/attribute/CFDE:anatomy/id@sort(id)?accept=csv" -o data/validate/anatomy.csv
else
    echo "csv with ids found for anatomy."
fi

Then, I added this to common_cfde.py and replaced most instance of REF_FILES with ID_FILES to use these csv files for filtering input lists.

ID_FILES = {
    'anatomy': 'data/validate/anatomy.csv',
    'compound': 'data/validate/compound.csv',
    'disease': 'data/validate/disease.csv',
    'gene': 'data/validate/gene.csv',
    'protein': 'data/validate/protein.csv',
    }  

In cases where the .tsv version of the file was used as both the reference file and the alias file, i kept them as is and commented them out of the snakefile. Mostly KG related things.

ctb commented 2 years ago

might be interested in: http://ivory.idyll.org/blog/2020-snakemake-hacks-collections-files.html

ctb commented 2 years ago

check out https://github.com/nih-cfde/update-content-registry/pull/71 - no need to merge, but I think it will result in less surprises down the road (plus it's a cute fun hack)

raynamharris commented 2 years ago

I added a little bit chuck of code to aggregate-markdown-pieces.py that makes a csv file with the total number of markdown chunks per widget!

    # print results
    jsonCounter = len(glob.glob1(dirpath,"*.json"))
    f = open("logs/chunks.csv", "a")
    f.write(f"{dirpath},{jsonCounter}\n")
    f.close()

This plot shows what I suspected which is that there are few instances where I have filtered my input ID list to 0. So said. Will work on a fix later.

chunks

raynamharris commented 2 years ago

Found a script for a lincs widget without a rule, so I added that πŸ‘

Need to find a way to filter the input lists that use the .tsv files as alias files by the .csv reference file 😞

Made a rule to make plots that you can run with make plots πŸ˜„

chunks

skipped

ctb commented 2 years ago

suggest merging this sooner rather than later. it's getting big. and I approved the changes as of 3 days ago... and now you've made a bunch more ;)

raynamharris commented 2 years ago

its not done yet! at least, i feel like the big picture isn't done yet. but i do see how maybe could have been/be broken up into smaller pieces

ctb commented 2 years ago

k ;)

raynamharris commented 2 years ago

the yak i am shaving is that there are four types of python scripts and i can't seem to find a single solution that will allow me to add a filter by ID_FILES because how the processes an alias file affects when and how i filter by a ref file.

i currently have different solutions for each type of those scripts. the first is working great, the second, not so much...

raynamharris commented 2 years ago

hence the new graphs for QC because i like visuals. although this probably address #57

raynamharris commented 2 years ago

its getting bigger πŸ˜† but also more better

raynamharris commented 2 years ago

TLDR

new common function for getting list of portal ids for validation

def get_validation_ids(term):
    # get list of validation retrieved form portal pages
    validation_file = ID_FILES.get(term)
    if validation_file is None:
        print(f"ERROR: no validation file. Run `make retrieve`.", file=sys.stderr)
        sys.exit(-1)

    # load validation; ID is first column
    validation_ids = set()
    with open(validation_file, 'r', newline='') as fp:
        r = csv.DictReader(fp, delimiter=',')
        for row in r:
            validation_id = row['id']
            validation_ids.add(validation_id)

    print(f"Loaded {len(validation_ids)} IDs from {validation_file}.",
          file=sys.stderr)

    return(validation_ids)     

validate and skip

# validate ids
validation_ids = cfde_common.get_validation_ids(term)

skipped_list = set()
id_list = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line in validation_ids:
                id_list.add(line)

            if line not in validation_ids:

                skipped_list.add(line)

                f = open("logs/skipped.csv", "a")
                f.write(f"{args.widget_name},{term},{line},ref\n")
                f.close()

print(f"Validated {len(id_list)} IDs from {args.id_list}.\nSkipped {len(skipped_list)} IDs not found in validation file.",
      file=sys.stderr)

check in alias file and skip, something like this but variable

# validate that ID list is contained within actual IDs in database
ref_file = cfde_common.REF_FILES.get(term)
if ref_file is None:
    print(f"ERROR: no ref file for term. Dying terribly.", file=sys.stderr)
    sys.exit(-1)

# load in ref file; ID is first column
ref_id_list = set()
ref_id_to_name = {}
with open(ref_file, 'r', newline='') as fp:
    r = csv.DictReader(fp, delimiter='\t')
    for row in r:
        ref_id = row['id']
        ref_id_to_name[ref_id] = row['name']
        ref_id_list.add(ref_id)

print(f"Loaded {len(ref_id_list)} reference IDs from {ref_file}",
      file=sys.stderr)

# load in id list
id_list = set()
skipped_list = set()
with open(args.id_list, 'rt') as fp:
    for line in fp:
        line = line.strip()
        if line:
            if line in ref_id_list:
                id_list.add(line)
            if line not in ref_id_list:
                skipped_list.add(line)

                f = open("logs/skipped.csv", "a")
                f.write(f"{args.widget_name},{term},{line},alias\n")
                f.close()

print(f"Skipped {len(skipped_list)} IDs not found in {ref_file}.",  file=sys.stderr)

added counter for input

# print length of input list
with open(args.id_list, 'r') as fp:
    x = len(fp.readlines())
print(f"Loaded {x} IDs from {args.id_list}.", file=sys.stderr)

added counter for output

# summarize output   
print(f"Wrote {len(id_list) } .json files to {output_dir}.",
      file=sys.stderr)  

and also in script/aggregate-markdown-pieces

    # print results
    jsonCounter = len(glob.glob1(dirpath,"*.json"))
    f = open("logs/chunks.csv", "a")
    f.write(f"{dirpath},{jsonCounter}\n")
    f.close()
raynamharris commented 2 years ago

some examples outputs

Running with term: gene
Using output dir output_pieces_gene/05-MetGene for pieces.
Loaded 1274 IDs from data/inputs/gene_IDs_for_MetGene.txt.
Loaded 19975 IDs from data/validate/gene.csv.
Validated 1202 IDs from data/inputs/gene_IDs_for_MetGene.txt.
Skipped 72 IDs not found in validation file.
Wrote 1202 .json files to output_pieces_gene/05-MetGene.
Running with term: gene
Using output dir output_pieces_gene/00-alias for pieces.
Loaded 19971 IDs from data/inputs/gene_IDs_for_alias_tables.txt.
Loaded 19975 IDs from data/validate/gene.csv.
Validated 19962 IDs from data/inputs/gene_IDs_for_alias_tables.txt.
Skipped 9 IDs not found in validation file.
Skipped 136 IDs not found in data/inputs/Homo_sapiens.gene_info_20220304.txt_conv_wNCBI_AC.txt.
Wrote 19826 .json files to output_pieces_gene/00-alias.
Running with term: anatomy
Using output dir output_pieces_anatomy/01-embl for pieces.
Loaded 353 IDs from data/inputs/anatomy_IDs_for_embl.txt.
Loaded 334 IDs from data/validate/anatomy.csv.
Validated 321 IDs from data/inputs/anatomy_IDs_for_embl.txt.
Skipped 32 IDs not found in validation file.
Wrote 321 .json files to output_pieces_anatomy/01-embl.
raynamharris commented 2 years ago

chunks skipped

raynamharris commented 2 years ago

I pushed to staging to test how things were working. I expected very few resources to refresh... The numbers are higher than expected, so I will look into figure out what is different.

2022-09-16 11:31:02,150 - INFO - Refreshed 1/366 resource_markdown values for 'anatomy' (353 in registry)
2022-09-16 11:31:11,843 - INFO - Refreshed 478/1901 resource_markdown values for 'disease' (1872 in registry)
2022-09-16 11:34:16,276 - INFO - Refreshed 13448/73500 resource_markdown values for 'compound' (59341 in registry)
2022-09-16 11:38:54,617 - INFO - Refreshed 12213/64149 resource_markdown values for 'protein' (64147 in registry)
2022-09-16 11:42:29,798 - INFO - Refreshed 0/19984 resource_markdown values for 'gene' (19971 in registry)
Resource markdown refreshed on release
raynamharris commented 2 years ago

I have a few 0s in my summary table of files created, but I think that is a math problem because the files are being created locally, they just aren't being counted. sigh. will investigate.

See https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md

Note: the last column is the one that is super important. this is the number of IDs that doesn't exist in the portal. these would normally cause the make update to fail if they were left in the workflow, but i removed them so the workflow runs successfully.

ctb commented 2 years ago

let me know if you want to work through this together at all!

raynamharris commented 2 years ago

meeting set :)

raynamharris commented 2 years ago

Okay, this is my new favorite report that tells me how many annotations were written or skipped. The one’s with 0s in the written column worry me. https://github.com/nih-cfde/update-content-registry/blob/retrieve-pages/logs/README.md

So, two relevant scripts to check for potential errors are build-markdown-pieces-gene-kg.py ( makes all the kg_widgets) and build-markdown-pieces-gene-translate.py(makes the alias_table widget).

See also the new code in the aggregate-markdown-pieces.py which does the counting