Open rabsef-bicrym opened 4 years ago
Having the same issue. I'm hacking things together, as:
generated_schemas/emptyhead_blacklist.txt
doesn't exist anywhereMETADATA_DIRECTORY
has been changed in load_metadata.py
without specification on why.Offending commit is here: https://github.com/jsfenfen/990-xml-database/commit/0d0ccc9a1f9baba899b635a11b7673a09da04ca7
For better or for worse, my workaround is:
METADATA_DIRECTORY
emptyhead_blacklist.txt
in that directory.Once I do that, I'm so far able to follow setup steps. I'm currently loading 2017 filings into a Postgres database.
@jsfenfen Looks like you checked in code for a version of IRSX that hasn't been released. Can you push that code / make that release? Or, can I create a branch here that works around the issue here?
See: https://github.com/jsfenfen/990-xml-reader/releases - there is no 0.2.8 release, as referenced in the commit I linked above.
@bcipolli I think that's worked for me, mate. Thank you. Incidentally, have you run year 2018, returns from 2017 and gotten a lot of this:
keyerror: [{'schedule_name': 'IRS990', 'group_keyerrors': [], 'keyerrors': [{'element_path': '/IRS990/ContractorCompensationGrp/ContractorAddress'}]}]
@rabsef-bicrym I didn't, but I assume this is the kind of path blacklisting that the system (which was implemented, but not merged) was meant to take care of. I believe adding /IRS990/ContractorCompensationGrp/ContractorAddress
to emptyhead_blacklist.txt
may avoid these errors, if needed. But, just a hunch at this point.
Thank you - I really appreciate your insight and assistance.
No problem. I'm walking through the same issues right now as well. :)
This is my final result:
This "missing" forms are ones that couldn't be parsed. I'm assuming that's due to the hackiness of the above approach, rather than getting the system to skip blacklisted xpaths. Hopefully @jsfenfen can upload the irsx updates, eliminate the need for this workaround, and get these errors to zero!
I'm not sure exactly how to check like you have above w/ the num_missing, but I got most of 2018 submission year successfully and I am STOKED. Thanks @bcipolli and @jsfenfen for everything.
Hey, apologies for the delay. I should probably rewrite the documentation for that part. Will try to take a look soon, I believe you don't have to run this step at all, I may be misunderstanding.
To @bcipolli's point, I believe the files that are not being entered are in unrecognized versions, generally this means schemas from 2013 or before, but if you do any digging and find that not to be the case I'd love to know what's causing it.
The "keyerrors" generally represent incorrect xpaths, @bcipolli is on target that there was a list of emptyheads that get blacklisted, though their existence is an artifact of some less-than-perfect code upstream.
@bcipolli: I can push missing versions, there was one version I suppressed b/c it had a bug, I think.
Thanks @jsfenfen !
@rabsef-bicrym - here's the query I ran to get that result:
with data as (
select
filing.submission_year,
--filing.return_type,
filing.parse_complete,
count(*) as num_filings,
count(f1040.object_id) as num_1040,
count(ez.object_id) as num_ez,
count(pf.object_id) as num_pf
from public.filing_filing filing
left join return_part_0 f1040
on f1040.object_id = filing.object_id
left join return_ez_part_0 ez
on ez.object_id = filing.object_id
left join return_pf_part_0 pf
on pf.object_id = filing.object_id
group by
filing.submission_year,
--filing.return_type,
filing.parse_complete
)
select
*,
num_filings - num_1040 - num_ez - num_pf as num_missing
from data
where
num_filings != num_1040+num_ez+num_pf
I have the same problem. There is only an empty blank.txt file in my generated schemas. How did you guys solve it? I havent understood where and how you changed code. @bcipoli: How do I change back to the old METADATA_DIRECTORY? Secondly, I just create an empty text file - called emptyhead_blacklist.txt - in the file browser under the folder "METADATA_DIRECTORY"? --> I cannot find such a folder..
@Georg-coder The way I figured this out was by looking @ what @bcipolli said was the offending commit.
METADATA_DIRECTORY = settings.METADATA_DIRECTORY
#METADATA_DIRECTORY = settings.METADATA_DIRECTORY
METADATA_DIRECTORY = settings.GENERATED_MODELS_DIR
You want to change the third line to the first line - the change was the first line to the third line, so setting it back is "resetting the METADATA_DIRECTORY to the old value"
The directory into which you want to create the empty text file is 990*\irsdb\metadata
Let us know if that helps!
@rabsef-bicrym Thanks, it worked! Now Im doing: nohup python manage.py load_filings 2015 &
so it can run without a timeout in the back as described in the read.me. How do I know when it is done and where do I eventually find/see the downloaded data? --> I have connected a postgresql database to PyCharm (I called it "django"), which opens in my chrome browser and shows some graphs – but I cannot see anywhere the actual data (screenshot attached).
@Georg-coder It's going to be hard for me to instruct on that part of things. I personally am pointing a BI tool called Tableau at the database to visualize the data in the ways that I want to. I also ran all of my loads in a screen and not using nohup because I wanted to see if they errored out, which they did on several occasions - sometimes requiring deletion (backing up) of the XML file that loaded and other times just requiring a restart. I think you'll need to think through how you want to use the data and then attempt to connect to it once activity has died down. Additionally you can wait a significant period of time and run the command w/o nohup, again, to see if it is complete or not - the system is smart enough not to try to double load the files.
Best of luck. Perhaps one of the others can provide better instruction.
@rabsef-bicrym I think your approach with Tableau is pretty nice and I would love to have that in the end as well. Is it working well with Tableau to look at the 990-data?
As you advised, I did load this time in screen and indeed there is a problem that keeps occurring: ... run_filing 201510199349200201 Traceback (most recent call last):
File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 485, in wrap_socket cnx.do_handshake()
File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1934, in do_handshake self._raise_ssl_error(self._ssl, result)
File "/Users/Georg/PycharmProjects/990-xml-database/venv/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1663, in _raise_ssl_error raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (54, 'ECONNRESET') During handling of the above exception, another exception occurred: .....
--> It might have to do with a few installments I made to resolve problems I had before. I also added to the load_filings.py: import urllib3.contrib.pyopenssl urllib3.contrib.pyopenssl.inject_into_urllib3()
When the files before run, does that mean the 990-data (2015) is downloaded to my PC? I dont remember choosing/creating a location folder for it. I did connect a posgresql database (showing in the right upper corner of the screenshot) but dont really know how that works/what that is. I dont think I actually have server space there, but instead it is just a web-browser interface to do SQL queries... (https://www.postgresqltutorial.com/load-postgresql-sample-database/)
@Georg-coder Can you take this question to another task, please? Will help others find the question (and answers) more easily.
@rabsef-bicrym I managed to also view the data now in Tableua over the posgresql server. However, when loading the files, there is still an error terminating the download in between. In case you have stumbled over this or know the solution, I would be very grateful. I am working on this error since a few days: https://stackoverflow.com/questions/60046845/requests-exceptions-connectionerror-connection-aborted-remotedisconnected
Thanks for providing so much valuable insight @bcipolli. This metadata mystery is perplexing. I've been able to run the app, load filings, and even begin querying in sql & r, but I don't think i'm pulling metadata correctly.
I keep getting the following error:
ModuleNotFoundError: No module named 'metadata'
I've retraced every step I can think of:
My question: did your recall this issue during your deployment? If so, I'd appreciate any feedback--if only a general direction to investigate.
My only hunch at this point is that my BASE_DIR is pointing to a venv site packages path which isn't/can't be indexed correctly. (Please forgive me if I bungled protocol/semantics here--this is only my 2nd month programming).
I seem to only have blank.txt in generated_schemas right now. What am I missing in the setup steps to get this properly filled w/ csv files for the step above?
I'm re-following through the guide as my last attempt was unsuccessful. Gotten this far but I haven't seemed to load anything to generated schemas.