jsfenfen / 990-xml-database

Django app to consume and store 990 data and metadata
BSD 2-Clause "Simplified" License
23 stars 16 forks source link

How to load generated_schemas - appear as blank.txt #28

Open rabsef-bicrym opened 4 years ago

rabsef-bicrym commented 4 years ago

Load the metadata with from source csv files in generated_schemas with the management command: python manage.py load_metadata

I seem to only have blank.txt in generated_schemas right now. What am I missing in the setup steps to get this properly filled w/ csv files for the step above?

I'm re-following through the guide as my last attempt was unsuccessful. Gotten this far but I haven't seemed to load anything to generated schemas.

bcipolli commented 4 years ago

Having the same issue. I'm hacking things together, as:

bcipolli commented 4 years ago

Offending commit is here: https://github.com/jsfenfen/990-xml-database/commit/0d0ccc9a1f9baba899b635a11b7673a09da04ca7

bcipolli commented 4 years ago

For better or for worse, my workaround is:

Once I do that, I'm so far able to follow setup steps. I'm currently loading 2017 filings into a Postgres database.

bcipolli commented 4 years ago

@jsfenfen Looks like you checked in code for a version of IRSX that hasn't been released. Can you push that code / make that release? Or, can I create a branch here that works around the issue here?

See: https://github.com/jsfenfen/990-xml-reader/releases - there is no 0.2.8 release, as referenced in the commit I linked above.

rabsef-bicrym commented 4 years ago

@bcipolli I think that's worked for me, mate. Thank you. Incidentally, have you run year 2018, returns from 2017 and gotten a lot of this:

keyerror: [{'schedule_name': 'IRS990', 'group_keyerrors': [], 'keyerrors': [{'element_path': '/IRS990/ContractorCompensationGrp/ContractorAddress'}]}]

bcipolli commented 4 years ago

@rabsef-bicrym I didn't, but I assume this is the kind of path blacklisting that the system (which was implemented, but not merged) was meant to take care of. I believe adding /IRS990/ContractorCompensationGrp/ContractorAddress to emptyhead_blacklist.txt may avoid these errors, if needed. But, just a hunch at this point.

rabsef-bicrym commented 4 years ago

Thank you - I really appreciate your insight and assistance.

bcipolli commented 4 years ago

No problem. I'm walking through the same issues right now as well. :)

bcipolli commented 4 years ago

This is my final result: image

This "missing" forms are ones that couldn't be parsed. I'm assuming that's due to the hackiness of the above approach, rather than getting the system to skip blacklisted xpaths. Hopefully @jsfenfen can upload the irsx updates, eliminate the need for this workaround, and get these errors to zero!

rabsef-bicrym commented 4 years ago

I'm not sure exactly how to check like you have above w/ the num_missing, but I got most of 2018 submission year successfully and I am STOKED. Thanks @bcipolli and @jsfenfen for everything.

jsfenfen commented 4 years ago

Hey, apologies for the delay. I should probably rewrite the documentation for that part. Will try to take a look soon, I believe you don't have to run this step at all, I may be misunderstanding.

To @bcipolli's point, I believe the files that are not being entered are in unrecognized versions, generally this means schemas from 2013 or before, but if you do any digging and find that not to be the case I'd love to know what's causing it.

The "keyerrors" generally represent incorrect xpaths, @bcipolli is on target that there was a list of emptyheads that get blacklisted, though their existence is an artifact of some less-than-perfect code upstream.

@bcipolli: I can push missing versions, there was one version I suppressed b/c it had a bug, I think.

bcipolli commented 4 years ago

Thanks @jsfenfen !

@rabsef-bicrym - here's the query I ran to get that result:

with data as (
select 
    filing.submission_year,
    --filing.return_type,
    filing.parse_complete,
    count(*) as num_filings,
    count(f1040.object_id) as num_1040,
    count(ez.object_id) as num_ez,
    count(pf.object_id) as num_pf
from public.filing_filing  filing 
left join return_part_0 f1040
    on f1040.object_id = filing.object_id
left join return_ez_part_0 ez 
    on ez.object_id = filing.object_id
left join return_pf_part_0 pf 
    on pf.object_id = filing.object_id
group by 
    filing.submission_year,
    --filing.return_type,
    filing.parse_complete
)
select 
    *, 
    num_filings - num_1040 - num_ez - num_pf as num_missing
from data 
where 
    num_filings != num_1040+num_ez+num_pf
Georg-coder commented 4 years ago

I have the same problem. There is only an empty blank.txt file in my generated schemas. How did you guys solve it? I havent understood where and how you changed code. @bcipoli: How do I change back to the old METADATA_DIRECTORY? Secondly, I just create an empty text file - called emptyhead_blacklist.txt - in the file browser under the folder "METADATA_DIRECTORY"? --> I cannot find such a folder..

rabsef-bicrym commented 4 years ago

@Georg-coder The way I figured this out was by looking @ what @bcipolli said was the offending commit.

METADATA_DIRECTORY = settings.METADATA_DIRECTORY
#METADATA_DIRECTORY = settings.METADATA_DIRECTORY
METADATA_DIRECTORY = settings.GENERATED_MODELS_DIR

You want to change the third line to the first line - the change was the first line to the third line, so setting it back is "resetting the METADATA_DIRECTORY to the old value"

The directory into which you want to create the empty text file is 990*\irsdb\metadata

Let us know if that helps!

Georg-coder commented 4 years ago

@rabsef-bicrym Thanks, it worked! Now Im doing: nohup python manage.py load_filings 2015 &

so it can run without a timeout in the back as described in the read.me. How do I know when it is done and where do I eventually find/see the downloaded data? --> I have connected a postgresql database to PyCharm (I called it "django"), which opens in my chrome browser and shows some graphs – but I cannot see anywhere the actual data (screenshot attached).

Screenshot 2020-01-30 at 10 08 59 AM
rabsef-bicrym commented 4 years ago

@Georg-coder It's going to be hard for me to instruct on that part of things. I personally am pointing a BI tool called Tableau at the database to visualize the data in the ways that I want to. I also ran all of my loads in a screen and not using nohup because I wanted to see if they errored out, which they did on several occasions - sometimes requiring deletion (backing up) of the XML file that loaded and other times just requiring a restart. I think you'll need to think through how you want to use the data and then attempt to connect to it once activity has died down. Additionally you can wait a significant period of time and run the command w/o nohup, again, to see if it is complete or not - the system is smart enough not to try to double load the files.

Best of luck. Perhaps one of the others can provide better instruction.

Georg-coder commented 4 years ago

@rabsef-bicrym I think your approach with Tableau is pretty nice and I would love to have that in the end as well. Is it working well with Tableau to look at the 990-data?

As you advised, I did load this time in screen and indeed there is a problem that keeps occurring: ... run_filing 201510199349200201 Traceback (most recent call last):

File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 485, in wrap_socket cnx.do_handshake()

File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1934, in do_handshake self._raise_ssl_error(self._ssl, result)

File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error raise SysCallError(errno, errorcode.get(errno))

OpenSSL.SSL.SysCallError: (54, 'ECONNRESET') During handling of the above exception, another exception occurred: .....

--> It might have to do with a few installments I made to resolve problems I had before. I also added to the load_filings.py: import urllib3.contrib.pyopenssl urllib3.contrib.pyopenssl.inject_into_urllib3()

When the files before run, does that mean the 990-data (2015) is downloaded to my PC? I dont remember choosing/creating a location folder for it. I did connect a posgresql database (showing in the right upper corner of the screenshot) but dont really know how that works/what that is. I dont think I actually have server space there, but instead it is just a web-browser interface to do SQL queries... (https://www.postgresqltutorial.com/load-postgresql-sample-database/)

Screenshot 2020-01-30 at 6 31 44 PM Screenshot 2020-01-30 at 9 46 50 PM
bcipolli commented 4 years ago

@Georg-coder Can you take this question to another task, please? Will help others find the question (and answers) more easily.

Georg-coder commented 4 years ago

@rabsef-bicrym I managed to also view the data now in Tableua over the posgresql server. However, when loading the files, there is still an error terminating the download in between. In case you have stumbled over this or know the solution, I would be very grateful. I am working on this error since a few days: https://stackoverflow.com/questions/60046845/requests-exceptions-connectionerror-connection-aborted-remotedisconnected

effeappbw commented 2 years ago

Thanks for providing so much valuable insight @bcipolli. This metadata mystery is perplexing. I've been able to run the app, load filings, and even begin querying in sql & r, but I don't think i'm pulling metadata correctly.

I keep getting the following error:

ModuleNotFoundError: No module named 'metadata'

I've retraced every step I can think of:

My question: did your recall this issue during your deployment? If so, I'd appreciate any feedback--if only a general direction to investigate.

My only hunch at this point is that my BASE_DIR is pointing to a venv site packages path which isn't/can't be indexed correctly. (Please forgive me if I bungled protocol/semantics here--this is only my 2nd month programming).