Changes to Data Extract Reporting

marshalc commented 8 years ago

From @VirginiaChiocchia:

Issues that apply to all three datasets are the following: · The variables should be in the same format in all cases i.e. if numerical please do not include string/test such as “None” or “Unknown”; if datetime please do not include string/test such as “None” or “Unknown”; etc. I would also need the “prefixes” such as OPcXX, LOcXX and similar to be removed from the string. Removing these from all fields myself will take me a substantial amount of time and also this is not always consistent within a variable (e.g. in donor.retrieval_hospital and donor.person.get_ethnicity_display only appears in certain categories but not in others) so it will take me even longer to clean the variables. · I do not need rows where information is not present i.e. No Trial ID, as I will not be able to link this with any data anyway. If it’s easy and straightforward for you to exclude these, that would be helpful. · In allocation_report and procurement_report the number of headings/variables names does not match the number of fields for some rows e.g. in procurement_report the last heading (donor.last_creatinine) is in column U but the rows of all obsertvations end either in column V, W or X. In the latter dataset I think there is also some sort of mix-up in some variables e.g. column R (donor.get_diabetes_melitus_display) has completely different type of entries such as dates, categories (UW, HTK, etc.) and other coding (DOc01 Cerebrovascular Accident, Doc03 Trauma, etc.). Please double check the data have been extracted correctly in the datasets before sending them.

Specifically in each dataset I have noticed the following:

Organ Report – for any text variables (e.g. organ.perfusion_not_possible_because, organ.not_transplantable_reason, organ.recipient.organ_untransplantable_reason) would it be possible to not include the quotation marks “”? It would be helpful having only text or empty/blank

Allocation Report – what do you mean with “This is another report because you can’t flatten OrganAllocations into one per row. There can be an unlimited number of Allocations”? Also, you said that if allocation.reallocated == N, this is the final allocation recorded for a given Organ Trial ID. What if allocation.reallocated == Y? Are these Organ still part of the trial? Or it means they have been reallocated in some non-participating centre?

... now needs unpacking into separate tasks/issues.

marshalc commented 8 years ago

This has been reprioritised as Top Priority as of right now following conversation with @AllyBradley

VirginiaChiocchia commented 8 years ago

Correction about the third bullet point regarding number of headings/variables names not matching the number of fields for some rows and also the mix-up in some variables. This seem to be working fine and be aligned correctly when imported for use. Therefore you can ignore this point.

All other issues still apply.

marshalc commented 8 years ago

Unpicking into the following:

Variables in the same format - this seems to be a reference to an overly simplistic view of the data columns as strongly typed, which isn't accurate. However, it's likely a limitation of the tools being used by Virginia, so this can be worked around by:
- Identifying a primary base "type" for each column and making that clear in the report documentation, including highlighting the format that each type will take in the CSV data output
- Cleaning the output data to hide values that don't meet the type. In most cases that means using a blank value in the CSV output to hide None, Unknown, 0, Null and similar valid answers.
- Raise a question as to what analysis is being done and how ignoring some data effects that
Virginia is struggling with the translated data types (e.g. anything that uses a limited Choice set, typically highlighted by a column using "get_xxxx_display".) Proposed solution here is to provide:
- Two columns of data
- First will be the internal system representation of the data which will be "clean"
- Second will need to be a localised (en-GB) output of the display strings. These can either be discarded, or used as quick visual reference.
Rather than providing "all data uploaded on the database up to that date" as originally requested (see email 28th April @ 15:43 with the data extract specification attached), filtering of the data to only provide:
- Donor information only where a randomisation has occurred, and thus a Trial ID generated, for the Procurement Report
- Organ information only where a final recipient has been allocated
- OrganAllocations for all Organs associated with Donors in the procurement report
Organ Allocation report matches the column numbers throughout exactly. However, applications like Excel are unable to distinguish strings containing commas from commas separating values, even when the string is contained using double quotes (as the accompanying email states some columns will be). Ignoring the cheeky comment suggesting I didn't check my output, this looks like comma separated values will not be practical as a delimiter, and thus I will switch to using something else, probably a semicolon. I will also add additional cleaning and escaping to the various text entry columns so that values that are likely to confuse Excel are less likely to trip them up. More professional tooling is unlikely to have a problem with the existing data as the output is consistent as described.
The output in the Organ Report for rows with no allocated Recipient has two surplus commas, thus shows as two extra columns. This is an error. However, this error will disappear when these rows are excluded from the data as per item 3 above. Given they were consistently blank columns, these shouldn't have caused any issues.
Issue about data types being mixed between columns looks to be a result of poor data importing as explained above with the issue of using a comma as the separator. I can't find any examples in the data where this occurs when correctly read.
Request for having blank strings shown as null/empty values, rather than blank strings ("") noted. May cause some issues for the data separating, but I'll look at that as per item 4.
Allocation Report questions: Is an example of RTFM.
- Flattening: i.e. the process of translating relational data into non-relation flat data. This can be done via a number of techniques, several of which involve some form of data loss.
- From the documentation: wp4.compare.models.transplantation.OrganAllocation "Organs can be allocated multiple times before finding a definitive recipient. This class acts as the record of these allocations and a link between Organ and Recipient." This means that each Organ will have between 0 and N Organ Allocations depending on what happens to the organ in the transplant allocation and reallocation process.
- This is documented in the Transplant form design process (see diagram!) which shows that some organs are not allocated at all; some are allocated once to a final recipient; others can be reallocated many times over - however if any of the allocations are to a non Trial site, then we stop tracking them and they are dropped from the study.
- These rows can only be identified through looking at a combination of fields, not all of which were requested.

marshalc commented 8 years ago

Noting update from @VirginiaChiocchia sent in during creation of the last bit of analysis

VirginiaChiocchia commented 8 years ago

As noted above from both of us, points 4, 5 and 6 can now be ignored. CSV can still be used.

Re point 7 - let me know if removing ("") will create issues for the data separating and will therefore need to remain as it currently is. I will find another way to remove it.

Re point 8 - in randomised trials Allocation is a term commonly used to indicated the randomisation result (i.e. randomisation group/arm) hence the initial confusion. Apologies for it, as being completely new to the system, repository and certain terms it takes a while to get my head round this system, even after reading the manual. I cannot find the Transplant form design process, and cannot access it from the link above.

ouhft / COPE

Changes to Data Extract Reporting #105