redcap-tools / PyCap

REDCap in Python
http://redcap-tools.github.io/PyCap/
MIT License
169 stars 80 forks source link

Unexpected Behavior: DataFrame export not working on pycap 2.X (used to work on pycap 1.1.3) #246

Closed camilovelez closed 1 year ago

camilovelez commented 1 year ago

Hi, we are having an issue with the export_records function when using the parameter format_type="df" in either pycap 2.1 or 2.2. It cannot export the data as a DataFrame because there are missing fields. We explored the problem, and we believe it is caused by missing fields in the JSON file that is being exported with newer PyCap versions, compared with the JSON file exported with the version we previously used (version 1.1.3). This has been an issue for the backward compatibility of our development based on PyCap.

pwildenhain commented 1 year ago

Got it, thank you for reporting this 👍🏻

Can you confirm that if you install version 1.1.3 that the process does in fact work?

$ pip install PyCap==1.1.3

And then re-run your code (though you will need to change format_type with format for the old version) and see if it works

The reason to do this is to ensure that it's actually PyCap version issue and not a REDCap version issue

If it works with the old package version then the next step will be to try and create a reproducible example which demonstrates the unexpected behavior.

Lastly, you said this is only a problem for the df export right? So the json export works with the current package version?

camilovelez commented 1 year ago

Hi @pwildenhain, thank you so much. I can confirm that when using PyCap==1.1.3, this works correctly and the issue does not appear.

camilovelez commented 1 year ago

Also, just to clarify the issue a bit more, export_records does work correctly in pycap 2.X when setting format_type='json', the issue only arises with format_type='df'. Thanks yet again

pwildenhain commented 1 year ago

Ok thanks for double checking

  1. Can you send the traceback (error message you get)?
  2. Is it possible to share the JSON output with me? I understand there may be sensitive data you need to remove first, or even just a subset of the data would work as well. My goal is to recreate the error message in step 1, that way I can make sure the issue is fixed with whatever code I change
maferoaloaiza commented 1 year ago

Hi, @pwildenhain I am working with @camilovelez on this project. We are working with multiple forms from a RedCap Database from a longitudinal study. We are making the tests of the export_records function with a particular form from a Database that has as the primary key the field study_id. We are making our tests on a VM with PyCap 1.1.3 and a docker container with PyCap 2.2.0 to compare the output.

When we export the JSON file of the particular form using PyCap 1.1.3, we get a JSON object containing the keys study_id, which corresponds to the "record" of the event, and the key redcap_event_name, corresponding to the RedCap event names. When we extract the same form as a JSON file using the 2.2.0 version, we get a JSON object that does not contain those two keys. Those keys are not inherently part of the form, but we believe they were added to the longitudinal studies by PyCap to make some transformations, such as getting a DataFrame.

The error message we get when exporting the data as a DataFrame using PyCap 2.2.0 is shown below. The issue is that when exporting the data as CSV and then transforming it to DataFrame, it is not finding the study_id index, which, as I said before, corresponds to the database's primary key. Unfortunately, we can not share the output JSON file with you since all the information is sensitive and confidential. However, please let us know if we can provide any other detail that eases the error tracing and fixing. Thanks!

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/redcap/methods/records.py", line 264, in export_records
    return self._return_data(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redcap/methods/base.py", line 411, in _return_data
    dataframe = self._read_csv(buf, **df_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/redcap/methods/base.py", line 139, in _read_csv
    dataframe = pd.read_csv(buf, **df_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 321, in read
    index, column_names = self._make_index(date_data, alldata, names)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py", line 379, in _make_index
    simple_index = self._get_simple_index(alldata, columns)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py", line 411, in _get_simple_index
    i = ix(idx)
        ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/base_parser.py", line 406, in ix
    raise ValueError(f"Index {col} invalid")
ValueError: Index study_id invalid
pwildenhain commented 1 year ago

Great, this is all really helpful information. Can you also:

  1. Show what command you're running? Sounds like you're limiting which forms are pulled in
  2. Can you also post the following for your project
my_project = Project(url, token)
my_project.def_field
my_project.is_longitudinal

I would expect to see study_id and True

My current suspicion is that you're facing something similar that another user reported in #193. In a nutshell there was a fairly recent change in the REDCap API where when you limit to records export to certain forms, it doesn't include the "primary key" fields such as study_id and redcap_event_name (since as you pointed out, they don't live on that form).

One sure-fire way to confirm this is the case is to run the API request in the API playground (while still limiting the forms), and see if study_id and/or redcap_event_name are included or not. If not, then PyCap wouldn't know to include them. Version 1.1.3 isn't doing anything special is this regard either.

If this turns out to be an API issue, then I think your best recourse is one of the following:

Override the df_kwargs if you don't need the study_id or redcap_event_name (since the default behavior expects these fields)

my_records = my_project.export_records(format_type="df", forms=["my_form"], df_kwargs={"index_col": None})

If you do need these columns, then I suggest using the export_metadata method + the fields parameter of export_records method to automatically generate a list of fields that you want returned

(Code not tested, but you get the idea)

form_fields = my_project.export_metadata(forms=["my_form"])["field_name"].values.tolist()
export_fields = ["study_id", "redcap_event_name"] + form_fields
my_records = my_project.export_records(format_type="df", fields=export_fields)

Something that I could do to improve the user experience is warn if certain fields aren't found that we would expect to be there, and then only add something like redcap_event_name to the index if the column exists in the export

maferoaloaiza commented 1 year ago

Hi, @pwildenhain. Thanks for your answer.

  1. We are limiting the forms that are pulled in. For the tests, we are using just one form
  2. When we run the commands you say, we do get study_id and True.

We have run the API request in the playground, and, as you say, it does not include the fields study_id and redcap_event_name.

Maybe it could be good to include in the documentation of the export_records function the information that you are giving us here. Even more considering that the DataFrame format export expects those fields by default.

We will try the solutions you suggest. Thank you!

maferoaloaiza commented 1 year ago

Something that is still odd is that we are not getting that same error when using PyCap 1.1.3 with the same URL and token. So maybe there was some difference in how the DataFrames were generated with that version. And when we export the JSON files with both versions, we get the two fields with PyCap 1.1.3 and not with 2.2.0.

pwildenhain commented 1 year ago

Ah ha! I found it! You were right, Version 1.1.3 did explicitly backfill the "primary key" fields. Part of my confusion comes from becoming the package maintainer long after this original code was written.

https://github.com/redcap-tools/PyCap/blob/5a77a40564deb2b58e796e4fdac0c5c874dd2f0c/redcap/project.py#L500-L528

I remember deleting this code when I upgraded to 2.0.0 because I couldn't understand why we would need it 😅 and I guess now I know why.

Ok, I'll add this functionality and cut a new release. Thanks for reporting and for your thoroughness. This is one of the rare instances where I agree that we should "fix" the default API behavior.

maferoaloaiza commented 1 year ago

Thanks, @pwildenhain! Glad we could help and it won't be an issue anymore :)

camilovelez commented 1 year ago

Thank you so much, @pwildenhain!

raphaelchristin commented 1 year ago

I am still encountering a similar issue with the 2.4.0 PyCap version and the redcap_event_name field. Has this been fixed or should I used one of the workarounds?

pwildenhain commented 1 year ago

I am still encountering a similar issue with the 2.4.0 PyCap version and the redcap_event_name field. Has this been fixed or should I used one of the workarounds?

Can you open a new issue with the details? Happy to take a look