neuropsychology / NeuroKit

NeuroKit2: The Python Toolbox for Neurophysiological Signal Processing

https://neuropsychology.github.io/NeuroKit

MIT License

1.53k stars 410 forks source link

[Feature] Codebook Generation from Documentation #1017

Closed DerAndereJohannes closed 1 month ago

DerAndereJohannes commented 2 months ago

Description

This PR aims at creating a general Codebook that has all variables that can be generated from the NeuroKit package. The idea of this feature stems from discussion #1012 from @HeatherUrry.

The addition of this feature would allow users to more easily see what variables are extractable from NeuroKit and is useful in other software programs.

Note

This pull request should be considered more as a draft as some polishing would still be required. For example:

Not all variables are likely included yet
The field descriptions are mainly all just copied and pasted from what was already there, maybe we can update them more ? (maybe out of scope of this PR?)
The Codebook page should probably have a better description
Currently, the new directive does not support following new lines if we are still looking at the same field, so lines can get very long. (this will most likely fail the style checks) EDIT: I fixed this as I found out that directives also use the same python new line character (\)
We could still change column structure of the codebook. The current structure is: Field Name, Field Description, Field Category and Source File Name.

Feedback and change requests are all welcome. Let me know if I should also share a copy of the documentation _build.

Proposed Changes

I added an additional Sphinx directive that takes in information in the Returns section of the python source code documentation and creates a csv file. At the same time, the directive also places a formatted version of the information in the return section, such that this information does not have to be written twice. E.g., The line ECG_Raw|The raw ECG Data. appends ECG_Raw,The raw ECG Data.,Electrocardiogram,ecg_process.py to the csv file and adds a formatted string * ``ECG_Raw``: The raw ECG Data. to the documentation page. Note that if the description contains a comma, the entire field is encapsulated in quotation marks i.e., ECG_foobar,"foo, bar",... to keep csv formatting intact.

The CSV from the codebook webpage like in this example: codebook-page-visual

Here is a sample codebook: neurokit_codebook.csv

I am not sure if it is possible to add the table to the Codebook page from within the documentation generation process and so therefore I added some javascript to dynamically load the csv file into the codebook page. Please let me know if this is too much.

Checklist

[x] I have read the CONTRIBUTING file.
[x] My PR is targeted at the dev branch (and not towards the master branch).
[ ] I ran the CODE CHECKS on the files I added or modified and fixed the errors.
[ ] I have added the newly added features to News.rst (if applicable)

DominiqueMakowski commented 1 month ago

Thanks for this huge endeavour @DerAndereJohannes it is quite impressive. Given the automation that you implemented it should be relatively painless to maintain and update is that correct?

HeatherUrry commented 1 month ago

WOW. This is awesome, @DerAndereJohannes. Thank you! What can I do to help?

DerAndereJohannes commented 1 month ago

In reality, my pull request is actually quite small (from a code perspective)

Given the automation that you implemented it should be relatively painless to maintain and update is that correct?

There are really only 2 files I really mainly contributed with this change which are:

csv_codebook_directive.py: Which gets called when the directive is seen on a documentation string. It should hopefully be simple enough to understand what it is doing.
codebook.rst: Which builds the codebook page and adds some javascript to be able to display the codebook based on the new csv document.

Given that these files are quite simple and the main software that is doing the work is actually the sphinx documentation build system, I would argue that this new feature is easy to maintain.

Where all my line changes come from are simply using these directives in their correct places, replacing Returns blocks for this new system e.g.,

where the new directive converts the Returns lines with the directive and where the it then saves these to a csv on build and also writes to the documentation page as if nothing had changed. All styles are kept.

What can I do to help?

One way would be perhaps to create a better description for the codebook on the specific codebook site. As you can see in the image in my initial post, I wrote an extremely short text description for each of the sections (Codebook and Codebook Table). Although it should probably remain concise, I may have been a bit too concise? :p

Honestly, other than that, what I did was simply use the existing descriptions that were already in the documentation. There are some parts of the documentation that are not that consistent with each other and would probably require quite a bit of refactor e.g., I had to add the HRV non-linear features into the return statement here since they were all present in the description of the function, but the return block simply stated Contains non-linear HRV metrics.. My change here does mean that the data is technically duplicated (in the description and results), but I felt not that confident in rewriting these docstrings to an acceptable standard for this pull request.

However, any changes that could be done to help improve the codebook feature I am unsure if it is relevant for this PR since we would end up rewriting the majority of the docs in this PR if that were the case. I am sure the maintainers could come up with a good solution.

Thanks!

danibene commented 1 month ago

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: https://github.com/neuropsychology/NeuroKit/pull/785

I think it would be helpful if we could document each method only in the docstrings of the associated function, and the extract that description in a structured format so that it can be included in the generated report without having to manually add that description in a separate file.

DerAndereJohannes commented 1 month ago

Thanks for all of your comments danibene!

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?

I’ve done something similar here in case helpful: https://github.com/miso-sound/miso-sound-annotate/blob/cc36d781327d7d6dec425a8178714bab40a64fa3/docs/readme/gen_summary_table.py

I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: https://github.com/neuropsychology/NeuroKit/pull/785

This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted

DominiqueMakowski commented 1 month ago

I'm happy to merge it when @danibene gives the green light. I have no opinion regarding the javascript so it's up to what you guys feel is the most maintainable approach 🤷

danibene commented 1 month ago

Thank you for your responses @DerAndereJohannes !

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”

Do you think a different separator (e.g. tabs instead of commas) would be more robust to formatting mistakes in the docstrings?

Tabs can for sure work too and could maybe make it easier to read. However, the csv python library already automatically accounts for additional commas using the quotation marks around the field so it should be fine too. It is true, that no one will probably ever use a tab in a docstring though. Should I switch it?

If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.

I see you created a script and then had it run throughout your CI/CD Pipeline! This is of course also something we can do to avoid the additional javascript. With this pull request, I was aiming for just using sphinx for everything. But if I am allowed to do something similar to what you did, then I can convert it to that for the doc builder.

I think both would be possible. I also don't have a strong opinion about which would be more maintainable - I do see the advantages of using sphinx for everything. Feel free to keep the current version if you'd like.

Maybe I'm dreaming here, but I was also wondering if we could do something similar to make the "report" functionality easier to develop and maintain: https://github.com/neuropsychology/NeuroKit/pull/785

This is definitely a possibility. I would imagine this looking like what you did in your gen_summary_table where we could use a no-op decorator like @gen_report to the report function and then in the CI/CD make the decorator act as a macro and replace it with a function that is defined using the docstring information. The only argument against this for me would be function testing (However, there is probably a way around this) and the idea for code generation during CI/CD can be dangerous if not looked at very carefully (e.g., like the xz attack). I would not mind creating this feature too if it is wanted

Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled? (https://github.com/neuropsychology/NeuroKit/blob/master/neurokit2/data/data.py#L176-L180)

DerAndereJohannes commented 1 month ago

Thank you for the comments.

For example, I imagine one next step could be to make a list of the functions that need to be checked and compare these variables to the code book content (some of this checking could probably be automated).

Do you mean a way to double check that all the variables emitted from the functions are in fact the variables that are stated in the codebook?

Yes, exactly. It doesn’t have to be addressed in this PR if it’s a significant effort, but it might be worth considering when thinking about how to parse “meta-keys.”

I think this should be pretty easy to do even with a pytest which would require it to be conform for everyone before they merge a new PR which adds a key. I would be happy to implement this in a different PR.

If you think the current version with the CSVs will work just as well, I’m fine with keeping it as is.

You might be right that using a tab might be a lot simpler as CSV is not really a well defined format and a tab should be far rarer than commas in the descriptions. I think I will change this now.

Do you think we could achieve this without code generation, maybe by loading the text from a file, similar to how example datasets are currently handled?

I think I need to understand the report function a bit more before I can make an answer. Especially how loading the text from a file would be different than from implementing the text directly in the report function.

With all these good ideas and a strive for making the documentation more maintainable, I think it would also make sense to create a page that describes the automated processes and all the directives I have been writing (and the additional stuff to come).

DominiqueMakowski commented 1 month ago

I'll merge to keep the momentum going. Again, thanks for that impressive work. No doubt it will be really useful to users :)