hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Capture logger output when caching a result #248

Open jburos opened 7 years ago

jburos commented 7 years ago

Often when caching results there are logger warnings, info messages and possible errors that are reported on cache generation but are lost once the result has been cached.

The majority of outputs are written to a logger - this should in theory be fairly easy to redirect to a file stored next to the cached item, in a manner analogous to PROVENANCE.

In some cases, however, these outputs are written to stdout or stderr using print statements (e.g. I've noticed this in the case of netmhc errors - as in issue #247 ). These types of errors are even easier to miss on first pass & are also not preserved for cached results. It would seem ideal to instead capture these messages & redirect them to a logger -> which can then be captured along with other logger output. I see these two items as somewhat separate enhancements, since they do not depend on one another although they are complementary.

E.g. borrowing from issue #247:

Output on the first run (empty cache-dir):

In [11]: df = cohort.as_dataframe(on=[expressed_neoantigen_count])

expressed_neoantigen_count:  65%|█████████████████████████████████████████████▋                        | 30/46 [00:06<00:03,  5.29it/s]Traceback (most recent call last):
  File "/demeter/users/burosj01/projects/netmhc-bundle/netMHC-3.4/netMHC.py", line 536, in <module>
    main()
  File "/demeter/users/burosj01/projects/netmhc-bundle/netMHC-3.4/netMHC.py", line 528, in main
Traceback (most recent call last):
  File "/demeter/users/burosj01/projects/netmhc-bundle/netMHC-3.4/netMHC.py", line 536, in <module>
    fulldict[allele] = printresults(NNpred, seqs, inorder, preds, allele, methodlist[allele], peplen, wwwrun)
KeyError: 'HLA-A24:03'
    main()
  File "/demeter/users/burosj01/projects/netmhc-bundle/netMHC-3.4/netMHC.py", line 528, in main
    fulldict[allele] = printresults(NNpred, seqs, inorder, preds, allele, methodlist[allele], peplen, wwwrun)
KeyError: 'HLA-A24:03'
# ... [truncated] ...
expressed_neoantigen_count:  70%|████████████████████████████████████████████████▋                     | 32/46 [00:30<01:39,  7.11s/it]
expressed_neoantigen_count:  74%|███████████████████████████████████████████████████▋                  | 34/46 [01:19<03:24, 17.05s/it]
expressed_neoantigen_count: 100%|██████████████████████████████████████████████████████████████████████| 46/46 [03:07<00:00,  9.42s/it]

Output on the second run (after results have been cached):

In [16]: df = cohort.as_dataframe(on=[expressed_neoantigen_count])

expressed_neoantigen_count: 100%|██████████████████████████████████████████████████████████████████████| 46/46 [00:08<00:00,  8.74it/s]

I think what we would want to have is:

  1. At time of cache creation:

    • Capture relevant output from the logger instance
    • Report on current logger output, according to current logging level
  2. At time of retrieval from cache:

    • Report on captured logger output, according to "current" logging level

One consideration I have is that the user-configured logging LEVEL may be different at the time of capture (step1) than it is at the time of reporting (step2).

A second issue is how to separate the cached-log-events from the current-log-events. Reporting on both as if they occured simultaneously might be confusing.