Pre-release checks - Githubissues

ChristosLynn commented 1 year ago

G'day!

leprechaunt33 commented 1 year ago

Setting this as an initial issue template to track any pre-release changes in the README or the functionality.

leprechaunt33 commented 1 year ago

Adding TODO: FileChooserListView wrapper for folder selection. Must include only a view of folders. FileChooserListView throughout application must be rewritten to restrict to filters (eg: .png, .csv). It is possible a tree view of folders as a completely new widget may be appropriate as ordinary users are not used to selecting folders from what appears to be a file selection dialog. Also must choose initial folder from root of tree beneath application folder to avoid invalid folder error if folder does not exist. Thus a wrapper class is required

leprechaunt33 commented 1 year ago

Adding this for posterity: https://tdgq.com.au/structured-editing/why-dont-all-rtf-parsers-recognise-styles/

It seems my suspicions were correct: many parsers do NOT support stylesheet in RTF.

This is apparently why in JetBrains their RTF styling on the clipboard is so weird: it is to ensure the styling works in as many parsers as possible. This means not just setting \cbN but also \cbpatN (Microsoft Word ignores \cbN, see https://www.oreilly.com/library/view/rtf-pocket-guide/9781449302047/ch01.html "in one of the few cases I’ve yet found of Microsoft completely disregarding their own RTF specification..."

I note the book talks about \chshdng0\chcbpatColorNum and yet the equivalent cbpat command seems to do the same trick in the instances I have seen so far.

So I'll rewrite sections of our RTF writer to add both the style reference and the character formatting, because stuff it, just because parsers won't parse properly doesn't mean we shouldn't be syntactically accurate.

leprechaunt33 commented 1 year ago

Additional notes for discussion ... remaining pre public release checks/fixes.

I am considering a minor redesign to the query functionality which may give some improvements.The idea is that

We have already seen how a slight redesign of the base classes used to lay things out changes things, with annotations of the textbox above and below preferably in the padding region allowed for numeric queries to have a preview of the result set size displayed dynamically. What I am considering is the chaining together of different instances of these dynamic query form fields (text or range so far, but eventually also checkbox/trinary group as we have now.

The UIX concept would be upon performing an action that causes a change in the size of the resultset, the program would check if any prior inputs exist in the widget tree that have filters applied and create an annotation in the form of a transparent arrow from one control to another with either the word 'and' or 'or' which can be toggled between. Similarly, the widget would notify any widgets downstream of it in the widget tree that are set whenever its resultset changes, .while any notification function receiving such a notification should break link with any existing left or right branch, ie the tree must remain consistent.

These considerations lead to a unique design in which the graphical progression of the reduction of the dataset is gradually arrived at as the filter. expression is constructed through the chaining of these objects. It then allows us to remove the logic of determining the ordering of filter application as is currently designed and allows us to focus on UDQ (user defined queries)

The second thing would be solving the problem of that seemingly or apparently missing data in the columns, which I won't go into here, but what I will say is that entails us writing a secondary aspect to the query screen to allow that auto complete database access goodness like we get in jupyter labs but without the disadvantage of having to maintain a messy notebook of lines of code. I would consider the initial query functionality complete at this point.

I need to finish the time series data graphs and maybe a reordering of the buttons is in order.

leprechaunt33 commented 1 year ago

Additional documentation of my thinking process here... I've already used the on_df function to implement changes to occur when the program has finished loading and checking the dataframes, ie bindings for the on_text functions to be bound once this sanity check is cleared. The same concept can be used here, although all widgets must implement left and right branching notification functions, left to say "I'm connecting with you, notify me of updates to your dataset", right to say "I'm connecting with you, here is your new dataset to reduce or expand as you wish"'. If this is done via a FunctionProperty or somesuch equivalent in Kivy, and a logical lock is implemented thus to prevent infinite loop of notifications, we have a bidirectional pipe where by setting the origin as the caller's self property, the notification function is only called if the caller's address is different to the already existing code reference. And then from the self set, the implemented "resultset has changed" function can be called to allow the newly connected object to perform any logic it needs to perform like updating the arrow connecting the two of them. Typically the element earlier than the caller would ignore any notifications of changes in the result set while elements downstream of the caller will accept the notification and update their own display based on their existing state.

The expression generator would similarly call any immediate upstream cousin to get its traitlet which in turn returns the chained traitlets before it in the widget tree that are set. So this might build the query from ER visits, the double traitlet,bracket it, and logically and/or it per its toggle value, with the traitlet deaths before it, Thus, calling the expression generator function at a given point in the widget tree returns the final query at this point in the form, and the query function simply executes the query returned by the last object in the chain (list).

This seems logically complete and capable of generating reasonable query expressions rather than sub-bracketed madness.

leprechaunt33 commented 1 year ago

Eventual direction of the program.

I would like to abstract the labels on the query page and recordset viewer, as well as the layout (for print and screen) away from the main code. Ditto for the database fields and database location.

The idea would be to move away from reliance of the program on a single source of data -- in this case VAERS -- towards being able to abstract the program to being able to query other datasets and display MPL insights for these other datasets. Moreover, using the SERB code to speed up joining tables/taking records, as well as abstracting design patterns for creating the MPL graphs, maybe even allowing pluggable python modules to add additional functionality, the program can eventually be transformed into one that can handle any type of health dataset (with the intention to focus primarily on health related datasets, but extending beyond that to middle of the road big data in general).

This would then move the program from the niche market of researchers of one dataset and into the realm of data analytics in general. It would make data analytics more accessible to the armchair user with basic database/Excel knowledge for instance.

ChristosLynn commented 1 year ago

UK Yellow Card - SUMMARY data https://coronavirus-yellowcard.mhra.gov.uk/datasummary

leprechaunt33 commented 1 year ago

The expression generator would similarly call any immediate upstream cousin to get its traitlet which in turn returns the chained traitlets before it in the widget tree that are set. So this might build the query from ER visits, the double traitlet,bracket it, and logically and/or it per its toggle value, with the traitlet deaths before it, Thus, calling the expression generator function at a given point in the widget tree returns the final query at this point in the form, and the query function simply executes the query returned by the last object in the chain (list).

This has now been implemented in basic functionality within the FluxCapacitor class and can now be extended to all other dynamic query fields.

leprechaunt33 commented 1 year ago

The comment last night might have been premature but the flux capacitor class is now fully functional and the capacitor bank rotates to let you choose any logical or binary query you can dream up. I'll extend it to the rest of the less useful fields like RECOVD and clean up the graphics in the next couple of days

leprechaunt33 commented 1 year ago

Ideal screen size. At this size the flux and first few fields are easily visible and the feel of page down ideal for content size where fixed.

leprechaunt33 commented 1 year ago

Adding content for the graph I've now titled "Data insights by % filled"

This gives you a whole range of information at a glance if you know what you are looking at.

For example, this is a set of 547 records.we can see HOSPITAL and DISABLE are both filled out. This is because we asked in the query for these two fields to be Y.

Actually we asked for hospitalisations or deaths where there was disablement involved. They had to be life threatening, so that field is 100% filled too. Had we not cared about these fields, the number of NA values in the case of VAERS means the percentage where this field was effectively no or irrelevant. In other words, at a glance as you will usually have at least one field entirely not filled out, the length of the bar will effectively give you a percentage hospitalised vs non hospitalised, and so on for the other 'boolean' fields.

You can see in the above sample set that not a lot of people died that were also disabled where long reports were written up.

Other fields tell you about the data quality for the sample you selected. For example, 2/3 of reporters didn't bother to fill in the report date, or if they did it was not recorded. This could lead to other questions like what is the source of the non reporting - the patient/their family, the doctors, finding a suitable person to file the report, keying issues... etc

A good 35% provide at least something in the lab notes/data which is a positive sign in that potentially there may be things in that lab data that provide further data insights, either by human or machine review.

The number of birth defects in this dataset is surprisingly high. One might miss that by looking at the "At a glance:" graph which shows big bars for hospitalisations, life threatening, and disabled. In this graph here, we can see that actually BIRTH_DEFECT is only NA around 420 or so times out of the 547, leaving over 120 where they at least indicated something in this field. Did they indicate lower case Y or N explicitly? We don't know without looking closer. One should always be careful coming to conclusions before looking closer at the data.

You can also see almost 50% were on other medications at the time of the adverse event. However, one should also entertain the possibility that a large number of those could be mere doctor annotation of "none mentioned" or some trivial phrase meaning don't know, or even "no". Thus the graphs and the data viewer itself should be used side by side when forming any conclusion, to ensure the nature of the data is understood.

I'll review for more to add tomorrow.

leprechaunt33 commented 1 year ago

It'll be worth going into how viewing this in conjunction with graphs 7 and 8 are helpful in grasping the structure of the sample you are looking at at any given time. In this example, you can see a lot of reports are foreign. That's to be expected as this is not a batch specific query, and you can see that after unknown, there were only 10 batches with 2 incidents, while all the remaining 521 are distributed among different lot numbers. However, despite this normalization of the lot data - and it is noteworthy that over 50% of samples in this dataset reported no lot number, so that tells you our data quality is not great in these cases (or maybe a large percentage are vaccines without lot numbers?) - despite his normalisation in the batches, state was filled in consistently in almost all cases as we can see, and consequently we can tell that Florida was disproportionately affected by the type of reaction (to COVID only? Or to all vaccines in general?

This is where you would narrow by state to find out more about what kinds of stories the data is telling). Just under 40 reports from one state of 52, So its a 40% jump from the next one down. This is where the "narrow x limits" functionality on the graph screen becomes handy. You can slowly remove the smaller states one by one to visually see that the Florida bar remains consistently taller in slope (removing foreign from the right side and resetting y to see the full extent of the difference)

Another clue in this dataset to what we're dealing with is the age at vaccination. Almost 60 of 547 are infants. We did not select a vaccine type in our query hence the birth defect information in the first graph makes sense. In further exploring this data set we might remove that bar by narrowing the range and then recalculating (maybe an auto recalculate mode would be useful?) You would then see the 3 spikes at 70, 15-20 yos, and 60. And so on.

I think this is probably a useful query to use to demonstrate in the manual as it shows use of the and/or functionality, theres a diversity of data problems and non issues in graph 1 to explore in going over why this one is so helpful and why it pays to use the other graphs with the first in order to get a clear picture of what is going on.

leprechaunt33 commented 1 year ago

Content for documentation/manual.

We should cover in the batch numbers the very real problem of keying errors in the data. EK5730 is a great example of that.

Here are some of the variants for EK5730 (note some of these are not returned by simply running a regular expression on the correct batch)

EK5u30 (note how this is a clear keying error as the two keys are diagonally next to each other)
There is case difference in entry (hence why both exact and inexact match use case insensitivity) eg ek5730
There is a single record EK5700 that is possibly a misreading of EK5730 (as typically a batch will occur multiple times in the dataset, not just once, through sheer probability)
There is EK573, leaving off the zero
There is prefixes or suffixes with the name of the manufacturer or the word LOT, or even just a single hash character
There is EK57030, with an extra zero added in the middle
There is EK57 (space) 30
There is EK5830 and EK5630, which may or may not be keying errors.
There is "EKSF30 OR EK573", as if the person entering the report figured this was a multiple choice thing
Sometimes all of these problems combine into one, as in "PFIZER LOT EK57", or even "PF-19EK5730"

The recommendations to give to users of the program are, if there is a particular batch number they are interested in, try searching a partial identifier, or several partial identifiers, using the regex search functionality which is there for that reason. Users can separate multiple substrings to search for using or (pipe symbol) for instance,like "EK|5730", and then adjust their query based on the results found that are clear matches to the batch.

leprechaunt33 commented 1 year ago

Regarding narrowing the dataset.

I think it is advisable to leave the narrowing keys as they are to prevent accidental narrowing of the dataset, and keep the control + arrow keys for something less problematic, such as the horizontal equivalent of page down. With a slider or text box option for narrowing without hotkeys, this should not be an inconvenience.

leprechaunt33 commented 1 year ago

As the image shows, the problem of the symptoms going beyond the header size has (hopefully!) been fixed for now. This was the largest symptom set I've found yet (69 of them!) Patient header is resized to the texture height * 1.1 or 100, whichever is larger

leprechaunt33 commented 1 year ago

It seems the final thing that has been causing grief (the package size) is due to a number of factors, but specifically conda may be part of the problem. I'm sure the more problematic libraries which include optional backends to other libraries (matplotlib and pandas) are also a large part of the problem, but the posts I've seen online suggest conda causes issues.

I think it could be worth setting up a VM with a python distribution using a different python / virtual environment with minimal modules and attempting a compile with the existing spec on that VM

leprechaunt33 commented 1 year ago

Fixed the bug with "no adverse event" not displaying any records. It was due to the filter set not being sorted by VAERS_ID prior to the SERB method being used to return the records via a df.take. Confirmed total "no adverse events" for bivalent booster equals the number for bivalent booster plus the regex search, so the filters are confirmed working together properly. Moving the symptom search to the end should provide a minor speed improvement

leprechaunt33 commented 1 year ago

Closing this pre-release conversation as we are close enough to release of first stable version to no longer require it

leprechaunt33 / VAERity

Pre-release checks #1