Open joeflack4 opened 2 years ago
@DaveraGabriel FYI @stephanieshong I don't remember who else might be working on this, but feel free to link them to this / or "add to assignees".
We will assign this task to Rohan Hurer.
get_ipython().system('pip3 install --user nltk flashtext')'punkt')
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_dict = {
"codesystem": ["DICOM","SNOMED", "LOINC", "ICD10CM", "ICD10PCS", "NDC", "RxNorm" ],
"HL7Productfamilies": ["CDA", "C-CDA", "V3", "Version3"],
"TerminologyResources": ["ConceptMap", "CodeSystem","ValueSet","Terminology Service","TerminologyCapabilities", "NamingSystem", "Coding", "Code", "CodeableConcept"],
"Operations": ["$lookup", "$validate-code", "$subsumes", "$find-matches", "$expand", "$validate-code", "$translate", "$closure"]
keyword_processor.extract_keywords('zulip activities based on code system, HL7Product family, Terminology Resources and Operations')
Some options we discussed: a. Fetch stream topic message text strings and query them separately, then aggregate the results. b. Concatenating the text of all topics together into one big string of text, and then query that.
My instincts lean me towards (a) for some reason, but I think both are potentially good.
(Originally taken from: Requirements google doc) Zulip Terminology Stream Text Mining Project
Base Zulip bulletin board application is supported by a REST API that can be interrogated (?) via Python scripts. Bots can be configured via Python to provide real-time monitoring as well. Text mining of the terminology stream in the FHIR Zulip community bulletin board to discover trends regarding use of terminologies and terminology services within the HL7 FHIR community.
Objective of this exercise is to review the history of the content and activity Terminology stream.
Task list
Task details
(Refer to for more info, especially for 1-5: Requirements google doc)
6a. Thread length
6a.i. Average length of threads: Determine average length (in days / wells / months) in terminology stream threads. 6b.i. Identify outlier threads in terms of length: Identify outliers in length - longer running threads
Possible solutions: For this, can aggregate all thread lengths (i.e. in terms of number of messages) and report 2 different classes of identifiers: (i) 1 standard deviation away from norm, and (ii) 2 standard deviations.
6b. Threads lacking adequate resolution
Identify those topics with (i) many responses (not necessarily with longer length, but will likely be one of these as well) that (ii) do not have some sort of resolution. Will require iterative review with SME (Davera or others)
Possible solutions: (i) Many responses: Can potentially be defined as 1 standard deviation away from mean. (ii) Lacking resolution: This would likely be too time consuming to automate; so should go with suggestion of SME review. However, we could programmatically automate / aid this analysis, perhaps, by re-reading the analytical output. The output (likely a CSV file) could have 1+ codified curator columns, where data will be manually entered by SMEs. Then, that information could be re-read if further programmatic analysis is needed.
6c. Frequency variance
For each of the count categories (1-4) above, when is the occurrence of these topics, when are they more frequent / less frequent
6d. Activity variance
Date-base counts for all topics indicating activity levels: when is the stream more active / less active
Additional info