unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
929 stars 202 forks source link

Python package #267

Closed acxz closed 2 years ago

acxz commented 4 years ago

@JoshData I'm back haha, (This might be my last one in a while)

This patchset allows users to install congress as a python package. This offers several benefits:

  1. dependencies: can be directly specified in the setup.py instead of having a requirements file. This ensures that users who want to use congress will have the requirements instead running an extra command.

  2. exposing entry points: the main run and test/run files can be run anywhere as long as your python environment is the same one where you installed congress in, providing more convenience to users say in writing their own shell scripts without hardcoding the congress directory. Of course you can still use ./run and python run as before, so documentation does not need to be changed.

  3. Makes it easier for package managers to work with. While this does not directly reslve #188 if someone wants to, one can easily upload congress to PyPi now, saving users the step to even clone the repo. Distributions that currently package this codebase, such as Arch Linux, currently manually install the scripts to the usr/bin folders. (https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=congress-git) Now they can use built in python build instructions such as python setup.py install.

If this is merged, this can fixes #192

acxz commented 4 years ago

@JoshData friendly ping

acxz commented 4 years ago

@JoshData friendly ping

JoshData commented 3 years ago

Hey,

Sorry I'm actually confused by the changes. If one follows the README steps, I think you will get a copy of the library installed in the usual Python packages location, but then it says to run "./run" which will run the version that has been cloned. Or will it? Will it get the modules in the working directory or will it go to the installed package?

acxz commented 3 years ago

If we following the README as is, ./run will execute the version that has been cloned. Typing just run will execute the version that has been installed.

JoshData commented 3 years ago

I don't think this is a good idea...

acxz commented 3 years ago

Do you mean to say that the documentation should be changed to run or that run should not be a command that should be called from anywhere?

Just to clarify ./run will only work if you are in the cloned directory.

JoshData commented 3 years ago

The documentation needs to present a clear way to use the package, so it has to either use an installed package or not install it. (It shouldn't install a package but then not use it.) It shouldn't create two local copies of the package either (the git clone and then the one installed by pip). And "run" is a bad name for a globally-installed script.

acxz commented 3 years ago

The documentation needs to present a clear way to use the package, so it has to either use an installed package or not install it. (It shouldn't install a package but then not use it.)

Thats sensible, I'll change the local ./runs to runs.

It shouldn't create two local copies of the package either (the git clone and then the one installed by pip).

This is a harder one to tackle. As the only way to install the package is to clone it first, unless you are using a package manager. There is no way around this unless you symlink the two with pip install -e . which is commonly done when developing on python code. I think this should stay the way it is.

And "run" is a bad name for a globally-installed script.

Technically it will only work in the virtual env the package is installed, but yeah. Totally agree. Do you have a proposition for a better name?

acxz commented 3 years ago

@JoshData any further thoughts?

  1. Or should I go ahead and not include the scripts to be installable
  2. Or should we go ahead and just scrap the idea of having congress as a python package?
JoshData commented 3 years ago

Or should we go ahead and just scrap the idea of having congress as a python package?

Is this PR solving a real problem that you are having, or is this just for fun?

acxz commented 3 years ago

I think I have outlined the real world problems this resolves in the OP.

For most users however, this brings us one step closer to resolve #188, i.e. able to install unitedstates/congress with a pip install united-states as well as being able to pull dependencies properly instead of using requirements.txt.

I would very much like for this PR to be merged in as I believe it is useful. I just suggested the second point as ultimately I don't want to keep pushing this unless you believe there is merit in this PR, since I should respect your wishes as being the current maintainer of this project.

aih commented 3 years ago

@JoshData, @acxz I also think that having this as a Python package is helpful. As a general principle, I've moved to having all of my Python scripts as packages so that I can call their modules without remembering the paths or setting environment variables.

In this case I think @JoshData has a valid concern about namespacing. Using run or runs is just not clear outside of this context. The command should be something like usc-run. More broadly, for me, the greatest value of a package is not the command-line, but being able to from uscongress.bills import get_bills_to_process. This would require a more significant change to create a unique namespace. I think this can be done by changing the 'tasks' directory to 'uscongress'.

This may be disruptive to current users, but I think it provides more value than being able to call the run command from anywhere; if that were my main problem I'd add an alias to .bash_profile.

If @JoshData is open to the more significant change above, I would suggest a few updates (in both the code and documentation):

  1. Create a namespace (e.g. uscongress) to call the scripts as modules (e.g. from uscongress.bills import get_bills_to_process).
  2. Clearly document how the modules of this package can be called from other python scripts (i.e. what is the namespace in #1). Identify any breaking changes this will introduce for current users.
  3. Handle the namespace issue for the command-line. Any global installation should use something clearly distinctive, like usc-run. I think it shoud be possible to still use './run' from within the directory so that any users' cronjobs or other current uses are not affected. This is, to me, less confusing than ./run vs run.
acxz commented 3 years ago

The command should be something like usc-run.

usc-run is a good suggestion, I'll go ahead and update the PR to reflect that.

More broadly, for me, the greatest value of a package is not the command-line, but being able to from uscongress.bills import get_bills_to_process.

That was actually on my to do once the setup.py is integrated in this PR. Wanted to take it a step at a time ;)

Points 1 and 2 should be addressed in a follow up PR. I'll go ahead and address Point 3 in this PR.

Thanks for your input @aih!

acxz commented 3 years ago

hmm turns out I can't alias the run script with setup.py's scripts field. I would have to use console_scripts of entry_points, but that requires a python module rather than a script, which means I would have to change the structure such that run is a module. Hence requiring points 1 and 2. @aih if you think you can accomplish installing the run script under a usc-run with minimal changes, send a PR over at my fork and this branch.

Otherwise, personally I think this PR should be merged in as it is, and we can address @aih's concerns in a later (possibly more disruptive, in terms of file structure, PR). Just getting a setup.py in this repo's pipeline is a good atomic PR imo.

@JoshData if you could let us know if you are still interested in this feature that would be appreciated.

Andrew-Chen-Wang commented 3 years ago

Definitely useful to have this set as a package:

  1. Easy to upgrade in case the scraped websites change formats, especially if you're using this with several other applications.
  2. Useful to simply pip install rather than use git clone. Something like pur and dependabot helps a lot knowing when you can upgrade.
  3. What if you want customized code or customized parsers? Well, that's why it's great to have a tag release with a CHANGELOG and breaking changes and mini functions in the code for easy extensibility.
  4. Knowing which Python versions are supported per release would be great! In general, git tags would be helpful.

I agree with @JoshData for the PR itself; I don't think it's a good idea to include the run script, or a "command" script at all. The modules separated as is is good enough for people to simply import (e.g. the vote.py file is meant for vote scraping, and that's it, just like firebase admin); it'll just need extra documentation. +1 on making this a package though!

acxz commented 2 years ago

Handle the namespace issue for the command-line. Any global installation should use something clearly distinctive, like usc-run. I think it shoud be possible to still use './run' from within the directory so that any users' cronjobs or other current uses are not affected. This is, to me, less confusing than ./run vs run.

With the latest commit this is taken care of.

acxz commented 2 years ago

To really make the python methods and modules importable, I tried to create the setup.py file so that the directory structure could be kept the same. However, after spending quite a bit of time I wasn't able to get it to work. I decided to follow python packaging convention (this involves moving files into a congress/ subfolder) and imports now work properly. Due to some heavy changes I also squashed the commits for easy readability.

With this @aih's concerns are addressed and with a pip install ., congress can be installed where you can call the scripts and load/run methods located within the tasks/ files.

Associated doc changes have been added as well.

I would urge anyone interested in this endeavor to test out the package for their particular use case. Please feel to critique and state any concerns!

JoshData commented 2 years ago

This seems OK to me. Would be great if someone else could try it out next.

acxz commented 2 years ago

Would be great if someone else could try it out next.

pinging @stevesdawg @Andrew-Chen-Wang @aih

If you guys can test it out for your use case and report back if it works or doesn't please do so.

acxz commented 2 years ago

Thx for the review!

I would prefer if the package was versioned by date.

That actually makes a lot of sense. I'll add the change depending on @JoshData's opinion.

aih commented 2 years ago

This is great work, and I will test it over the next couple of weeks. My main use case is to: 1. Download bulk data (from ProPublica, here), then pip install, and then update the data with usc-run bills. @acxz is there any reason this would not work?

acxz commented 2 years ago

is there any reason this would not work?

None at all

Andrew-Chen-Wang commented 2 years ago

@aih things thatre worth testing seems like the CLI command, general importing of the package, and a single use case

sseshan7 commented 2 years ago

The test plan I ran for these changes, from inside the repo root:

  1. pip install .
  2. usc-run votes --congress=117 --session=2022 --force=True
  3. usc-run bills to process bills I've already downloaded.

These worked normally as expected! 🎉

One side-effect however, when I call usc-run from somewhere other than the repo root, it creates the data/ and cache/ dirs in the directory where I called it. I feel like this is undesirable, and it should always download to some default location unless a download destination is provided.

What are people's thoughts on this?

michaelblyons commented 2 years ago

One side-effect however, when I call usc-run from somewhere other than the repo root, it creates the data/ and cache/ dirs in the directory where I called it. I feel like this is undesirable, and it should always download to some default location unless a download destination is provided.

What are people's thoughts on this?

I agree with you, but I think that can be a subsequent PR and need not block merge. Right now, you can only really run from within the repo.

Andrew-Chen-Wang commented 2 years ago

If you made a system install and implemented something where the folders are created at "top level", you wouldn't see the folders at all anymore since they'd be next to or even inside the package content (ie wherever the environment is). I think this behavior is desirable.

sseshan7 commented 2 years ago

Command line importing looks good 👍🏾

Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 23:10:56) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import congress
>>> congress
<module 'congress' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/__init__.py'>
>>> from congress import tasks
>>> tasks
<module 'congress.tasks' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/__init__.py'>
>>> congress.tasks
<module 'congress.tasks' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/__init__.py'>
>>> from congress.tasks import bills
>>> bills
<module 'congress.tasks.bills' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/bills.py'>
>>> from congress.tasks import votes
>>> votes
<module 'congress.tasks.votes' from '/home/shrivathsav/anaconda3/envs/congress_py36/lib/python3.6/site-packages/congress/tasks/votes.py'>
>>> votes.vote_ids_for_house
<function vote_ids_for_house at 0x7fa0bcf91c80>

This looks good to me.

acxz commented 2 years ago

@JoshData it has been some time, since this PR has been tested by folks here. LGTM?

JoshData commented 2 years ago

Thanks all!