joeflack4 commented 2 years ago

Description

(Originally taken from: Requirements google doc) Zulip Terminology Stream Text Mining Project

Base Zulip bulletin board application is supported by a REST API that can be interrogated (?) via Python scripts. Bots can be configured via Python to provide real-time monitoring as well. Text mining of the terminology stream in the FHIR Zulip community bulletin board to discover trends regarding use of terminologies and terminology services within the HL7 FHIR community.

Objective of this exercise is to review the history of the content and activity Terminology stream.

Task list

[x] 1. Code systems: counts and date ranges
[x] 2. HL7 Product families: counts and date ranges
[x] 3. FHIR terminology resources: counts and date ranges
[x] 4. FHIR terminology operations: counts and date ranges
[x] 5. Indicators of JIRA activity: counts and date ranges
[x] 5.2. New category keywords counts & date ranges
[ ] 5.3. Persons and roles
[x] 5.4. Disambiguation of certain keywords
[ ] 6a. #6
[ ] 6b. #25
[ ] 6c. #26
[ ] 6d. #27
[ ] 7. #2
[ ] 8. Streams to query

Task details

(Refer to for more info, especially for 1-5: Requirements google doc)

6a. Thread length

6a.i. Average length of threads: Determine average length (in days / wells / months) in terminology stream threads. 6b.i. Identify outlier threads in terms of length: Identify outliers in length - longer running threads

Possible solutions: For this, can aggregate all thread lengths (i.e. in terms of number of messages) and report 2 different classes of identifiers: (i) 1 standard deviation away from norm, and (ii) 2 standard deviations.

6b. Threads lacking adequate resolution

Identify those topics with (i) many responses (not necessarily with longer length, but will likely be one of these as well) that (ii) do not have some sort of resolution. Will require iterative review with SME (Davera or others)

Possible solutions: (i) Many responses: Can potentially be defined as 1 standard deviation away from mean. (ii) Lacking resolution: This would likely be too time consuming to automate; so should go with suggestion of SME review. However, we could programmatically automate / aid this analysis, perhaps, by re-reading the analytical output. The output (likely a CSV file) could have 1+ codified curator columns, where data will be manually entered by SMEs. Then, that information could be re-read if further programmatic analysis is needed.

6c. Frequency variance

For each of the count categories (1-4) above, when is the occurrence of these topics, when are they more frequent / less frequent

6d. Activity variance

Date-base counts for all topics indicating activity levels: when is the stream more active / less active

Additional info

Links

Requirements google doc
Chat URL: http://chat.fhir.org
Zulip API docs: https://zulip.com/api/rest
Category keywords google sheet

joeflack4 commented 2 years ago

@DaveraGabriel FYI @stephanieshong I don't remember who else might be working on this, but feel free to link them to this / or "add to assignees".

stephanieshong commented 2 years ago

We will assign this task to Rohan Hurer.

stephanieshong commented 2 years ago

example of nlp keyword search that might be useful:

get_ipython().system('pip3 install --user nltk flashtext')

nltk.download('punkt')
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor()
keyword_dict = {
     "codesystem": ["DICOM","SNOMED", "LOINC", "ICD10CM", "ICD10PCS", "NDC", "RxNorm" ],
     "HL7Productfamilies": ["CDA", "C-CDA", "V3", "Version3"], 
     "TerminologyResources": ["ConceptMap", "CodeSystem","ValueSet","Terminology Service","TerminologyCapabilities", "NamingSystem", "Coding", "Code", "CodeableConcept"],
     "Operations": ["$lookup", "$validate-code", "$subsumes", "$find-matches", "$expand", "$validate-code", "$translate", "$closure"]
}

keyword_processor.add_keywords_from_dict(keyword_dict)
keyword_processor.extract_keywords('zulip activities based on code system, HL7Product family, Terminology Resources and Operations')

joeflack4 commented 2 years ago

Some options we discussed: a. Fetch stream topic message text strings and query them separately, then aggregate the results. b. Concatenating the text of all topics together into one big string of text, and then query that.

My instincts lean me towards (a) for some reason, but I think both are potentially good.

jhu-bids / fhir-zulip-nlp-analysis

Requirements #1

Description

Task list

Task details

6a. Thread length

6b. Threads lacking adequate resolution

6c. Frequency variance

6d. Activity variance

Additional info

Links

example of nlp keyword search that might be useful: