teddygroves / bibat

A batteries-included template for Bayesian data analysis projects
https://bibat.readthedocs.io/
MIT License
18 stars 2 forks source link

comments on the vignette #36

Closed OriolAbril closed 1 year ago

OriolAbril commented 1 year ago

The link to data from the 2006 Major league season is not rendered correctly (6th paragraph in intro)

"After I answered the wizard’s questions bibat creted a new folder called baseball that looked like this:" answers should be in the vignette (save author names and email, or maybe even with those and using placeholder ones)

I don't understand the part about "Since this worked, I added a new makefile target for the raw data files:", the . at the beginning of the last line doesn't look right but more importantly, this is not used anywhere.

Naming of paths is inconsistent accross vignette. Given how there are multiple files with similar names, I would stick to paths relatives to the project home for everything. For example, right at the beginning, we start editing src/prepared_data.py, then we move onto data_preparation_functions.py (which lives inside src too) and right after to prepare_data.py (now in the project home).

"To check that all this works, we can run the script prepare_data.py manually or using make analysis." doesn't work with make analysis because tests still refer to example data. Moreover, when running python prepare_data.py I get a pydantic validation error.

stopped here for now

teddygroves commented 1 year ago

Thanks very much for the comments and for catching some problems with the vignette.

The path naming should clearly be made consistent - sorry if it was confusing to go through! I also think it is kind of annoying to have all three of prepare_data.py, src/prepared_data.py and src/data_prepararation_functions.py in the first place and will try and consolidate these files.

I'll add the exact answers to the questions to the vignette. This is the reason for the test error you hit - I answered 'no' to the tests question but didn't record this.

The . in the makefile is to source the virtual environment as in https://unix.stackexchange.com/questions/114300/whats-the-meaning-of-a-dot-before-a-command-in-shell. The reason for adding the rule was to allow for automating fetching the raw data. However as you point out this isn't finished.

If you still have it could you possibly copy the pydantic error here?

teddygroves commented 1 year ago

I think pr #37 addresses most of these concerns - I am waiting to merge it for a github actions error that I haven't been able reproduce on my computer. I should have time to get to the bottom of it tomorrow.

teddygroves commented 1 year ago

The pull request has been merged and I think the vignette and the experience as a whole are improved a bit.

I checked all of the paths in the vignette and added the answers to the questions as suggested.

I also changed the target project's structure so that all python files other than tests are in the same folder and there are only two data preparation files - <package>/data_preparation.py and <package>/prepare_data.py, with the latter being a thin script. I think finding data preparation logic is now a lot more intuitive.

Thanks again for the helpful comments - please let me know if you think they have been addressed!

OriolAbril commented 1 year ago

Tried running the vignette from scratch and I am now somehow having cmdstan installation issues. I am on Debian 11. Here is the error message:

Installing CmdStan version: 2.32.1
Install directory: /home/oriol/.cmdstan
Downloading CmdStan version 2.32.1
Download successful, file: /tmp/tmp0mrlhx_s
Extracting distribution
Unpacked download as cmdstan-2.32.1
Building version cmdstan-2.32.1, may take several minutes, depending on your system.
Traceback (most recent call last):
  File "/home/oriol/Documents/repos_oss/baseball/.venv/lib/python3.10/site-packages/cmdstanpy/install_cmdstan.py", line 282, in build
    do_command(cmd, fd_out=None)
  File "/home/oriol/Documents/repos_oss/baseball/.venv/lib/python3.10/site-packages/cmdstanpy/utils/command.py", line 76, in do_command
    raise RuntimeError(msg)
RuntimeError: Command ['make', 'build', '-j1']
    error during processing No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/oriol/Documents/repos_oss/baseball/.venv/bin/install_cmdstan", line 8, in <module>
    sys.exit(__main__())
  File "/home/oriol/Documents/repos_oss/baseball/.venv/lib/python3.10/site-packages/cmdstanpy/install_cmdstan.py", line 668, in __main__
    run_install(InstallationSettings(**args))
  File "/home/oriol/Documents/repos_oss/baseball/.venv/lib/python3.10/site-packages/cmdstanpy/install_cmdstan.py", line 602, in run_install
    install_version(
  File "/home/oriol/Documents/repos_oss/baseball/.venv/lib/python3.10/site-packages/cmdstanpy/install_cmdstan.py", line 422, in install_version
    build(verbose, progress=progress, cores=cores)
  File "/home/oriol/Documents/repos_oss/baseball/.venv/lib/python3.10/site-packages/cmdstanpy/install_cmdstan.py", line 286, in build
    raise CmdStanInstallError(f'Command "make build" failed\n{str(e)}')
cmdstanpy.install_cmdstan.CmdStanInstallError: Command "make build" failed
Command ['make', 'build', '-j1']
    error during processing No such file or directory
make: *** [Makefile:18: env] Error 1

I am afraid I don't understand the message and I'm not sure how to go about it. Sorry about the slow response too

teddygroves commented 1 year ago

Thanks for trying again!

Installing cmdstan can be a pain unfortunately. From the error message I think it downloaded the cmdstan repository and saved it in the right place but the make build -j1 command failed.

If you have time and the folder is still around it would be really helpful if you could try running make build -j1 from the directory /home/oriol/.cmdstan/cmdstan-2.32.1. This should hopefully produce a more verbose error message that narrows down the problem a little.

OriolAbril commented 1 year ago

Got this:

❯ make build -j1
stan/lib/stan_math/make/libraries:117: *** "Need to set TBB_CXX_TYPE for non-standard compiler other than gcc or clang.".  Stop.

which was already mentioned before, but I do have gcc available:

❯ gcc --version
gcc (conda-forge gcc 10.4.0-19) 10.4.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(I even have cmdstanpy installed via conda forge and working on that same env)

teddygroves commented 1 year ago

I’m a bit stumped: my best guess is that something has gone wrong due to conda-forge gcc doing something that cmdstan doesn’t expect. Specifically, I expect that in here the variable CXX_TYPE is ‘other’ when it should be ’gcc’, leading to TBB_CXX_TYPE being wrong here.

For a workaround I think the best thing to do is look at how conda-forge installs cmdstan. I’ll check that as soon as possible and post again here.

More generally I need to check which conda/non-conda combinations the Stan ecosystem supports and adjust bibat accordingly. Currently bibat doesn’t do anything conda related. This is partly because I haven’t got much experience with conda and partly because it seems like it would be complicated to support both conda and standard Python virtual environments, which I definitely want to keep. I’m very curious to hear what you think about the second consideration!

teddygroves commented 1 year ago

I found a couple of reports about the same issue here and here. In both cases cmdstan failed to install with the same error and there was some conda stuff around.

Apparently adding these lines to the (possibly new) file /home/oriol/.cmdstan/cmdstan-2.32.1/make/local might work:

CXX=g++
TBB_CXX_TYPE=gcc

From this file it looks like the conda-forge recipe for building cmdstan also needs to set the variable TBB_CXX_TYPE in make/local.

OriolAbril commented 1 year ago

Restarted with a clean conda env (which had no gcc compiler so the system one was used and everything worked well). Here are some more comments.

teddygroves commented 1 year ago

Thanks a lot for persevering with the installation issues and for the new comments. I think the PR linked above should address them all.

OriolAbril commented 1 year ago

There is still a wrong formatted link in the vignette intro:

[data from the 2006 Major league season])https://github.com/stan-dev/example-models/blob/master/knitr/pool-binary-trials/baseball-hits-2006.csv)

As for code execution itself, now the exclusion of the pytest line leaves an empty line in its position, which breaks bash syntax for me. I got:

. .venv/bin/activate && (\

/bin/sh: 2: Syntax error: end of file unexpected (expecting ")")
make: *** [Makefile:35: analysis] Error 2

Removing the empty line from the instructions in make analysis fixed the issue, no idea how to get the template to add the pytest line only if necessary and leaving no empty lines in either case though.

The last line calling write_prepared_data() is missing from the provided prepare_data() code sample, there is also an empty black line at the beginning of the block. It also looks like MEASUREMENTS_FILE should not be deleted as indicated in

To finish off I deleted the unused global variables MEASUREMENTS_FILE, NEW_COLNAMES, DROPNA_COLS and DIMS

After that managed to run everything to completion.


Here are some a bit more general coments.

I am still very confused about the intended workflow. It is mentioned that you can run scripts manually or run make analysis in multiple intermediate steps, but up until the very end of the notebook make analysis won't run successfully. Are users expected to run make analysis in each partial step until it runs into an error?

I see the goal, and think this is necessary for reproducibility, but I'd welcome a bit more guidance on this. i.e. how much of a bad user am I being if I ran python baseball/sample.py instead of make analysis for sampling? Leaving make analysis only for the very last step?

I am also a bit surprised by the use of json to store the mcmc results. Why not zarr or netcdf even? zarr should generate files at least 2/3 times smaller (maybe more if compressing) and supports distributed reading "out of the box". I fear using json might prevent the template from scaling to bigger analysis

teddygroves commented 1 year ago

Thanks for catching those careless mistakes - sorry! The ones you pointed out are now fixed on the latest main branch, including the makefile whitespace: I should have checked that.

The workflow is intended to be neutral between running python scripts individually and running make analysis up to an error. In practice running the python script it is quicker as it avoids running some parts of the analysis, but running make analysis is also nice as it can sometimes catch an unexpected problem, plus having a single entry point means you don't have to remember the names of all the scripts and that execute is the command for running a jupyter notebook like a script. I don't think I wrote all that down anywhere though - thanks for highlighting the issue! It should definitely be clearer that seeing errors is part of the intended workflow, and I can see how the way the vignette uses both methods is confusing.

Regarding the format for storing InferenceDatas, this reflects my use - netcdf was broken on arm macs for a long period, and I didn't have a problem with too much output so I switched to json. I now quite like being able to read, copy bits of and even sometimes directly edit the json files. On the other hand as you say json is less scalable than the other formats so I guess they would be better in a lot of cases. I've made a new issue (#57) for looking again about this.

OriolAbril commented 1 year ago

Perfect, commented on the pyopensci issue too already. I'll close this now