gronlund / cvrdata

Extract data from danish CVR registry from Danish Business Authority
MIT License
7 stars 2 forks source link

Long time for `update` and some missing rows in tables #8

Open d0nghyunkang opened 2 years ago

d0nghyunkang commented 2 years ago

Hi Allan,

First of all, thanks for publishing package! I have two issues while using it.

  1. It takes quite a while to run python -m cvrparser update. I took about 100 hours to finish it. The downloading speed is pretty low, but I'm not sure why it's the case; when I run python -m cvrparser get_regs, it's much faster. To be concrete, 15 it/s vesus 2000 it/s.
  2. The update command returns tables that have some missing rows. I compared with another tables my colleague downloaded few months ago. I suspect whether it’s because of the error messages I get on the log files. I have attached all the log files for your reference (one consumer log file is particularly long, so I had to compress it).

My computing environment is: Ubuntu 22.04. Python 3.9.10. MySQL 8.0.30. CPU with 8 cores and 47GB Memory.

Thank you!

logs.zip

gronlund commented 2 years ago

Hi. The easy solution is not to use the update functionality and instead insert from scratch unless only a very short time has passed since last update. The update is very slow because it has to check if a company has information that has been updated and if this is the case it deletes all information for that company and reinserts it. So if many companies needs to be updated this is very inefficient. At the time of writing the code it seemed the only robust thing for many reasons. Another issue is/was the employee count tables, which at the time of writing could have been updated in the database without changing the data version information in the data, so It was not possible to make an easy detection of updates and hence that part is not perfect!. I do not think I ever fixed that and the semantics of how to detect updates may have changed in the data. So the update may in fact not give the same data as a fresh insert. If a long time has passed since last update, using update is very slow! Clearing the table and starting from scratch is much faster. Maybe I should write that somewhere but I have not really worked on the project for years. Also note that existing rows may be deleted at a later time by DBA perhaps due to errors or other reasons, and hence inconsistencies may occur, but I think this is quite rare.

Does this make sense? -Allan

d0nghyunkang commented 2 years ago

Thanks a lot for the explanation. I'll try with get_regs from the beginning. I'll come back after I have the results!

gronlund commented 2 years ago

Sorry I missed a part of your question. The get_regs is a new feature I added upon request not that long ago to extract data from a new data source in the DBA database. I think I only made it such that it would insert everything from scratch, but I would have to check to code so see if it actually can continue from where it is. That may be possible. More importantly, get_regs only inserts registration data, from, the at the time new source of data into a table, and it does not insert the entire CVR information into the database, making update and get_regs completely incomparable! as update inserts the full cvr data (excluding the registrations).

Hope that makes sense as well :)

d0nghyunkang commented 2 years ago

Thanks, I get it. Indeed, when I tried get_regs before, it only returned the Registrations table, but nothing else.

I think I've been always inserting from scratch, since I work on a new machine (via cloud computing) at each download attempt. Despite this, I've always had the same downloading problem -- slow speed and missing rows.

My order of execution is like this:

python -m cvrparser dbsetup -t # This creates tables
python -m cvrparser reconfigure # Sets up access 
python -m cvrparser update # Download tables
gronlund commented 2 years ago

I would reconfigure before dbsetup but as long as you do not change database name it should be fine I guess :). I just tested the following on my laptop CREATE SCHEMA dummy2 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_bin ; on mysqlworkbench conda create -n cvrtest -python=3.8 pip install git+https://github.com/gronlund/cvrdata pip install bs4 python -m cvrparser reconfigure python -m cvrparser dbsetup -t python -m cvrparser update

And it inserts about 1500-2500 companies per second initially and then slowing down as the database gets filled with entries but should finish in reasonable time. I will try and let it finish but it seems to be working. The missing rows I am not sure about.

I can see in the log file that a data entry seems to be to long for the corresponding SQL field. Maybe I should just make them all very long I can fix that at I think. Let me take a look.

gronlund commented 2 years ago

I found the issue. Some company had merged 15 different website names into one name making a very long string which did not fit In the database. I made the table column accept longer strings as I do not try to fix raw data even if it is not meaningful.

Hope it works and remember to create the tables again.

d0nghyunkang commented 2 years ago

I see, thanks a lot! I'll try downloading it now.

d0nghyunkang commented 2 years ago

Just to update you in the meanwhile: I've been downloading the tables, but it's still taking long. It's been 24 hours, and the speed now is 10-20 it/s. I'll try finishing the download (may take 1-2 more nights) and compare the number of rows again!

gronlund commented 2 years ago

It should only take an hour or two to fill the database so something is wrong. The system is made as a producer consumer, where the the producer downloads from DBA and inserts data into a queue that workers extract from and insert into MySQL The it counter only shows how many companies the producer are downloading per second which should be in the thousands. So maybe your connection to DBA is being throttled/blocked somehow it should be much much faster. There is a possibility that the workers are to slow and the producer then needs to wait as their shared queue has a maximal size, which would reduce download speed. In this case the producer should produce time out errors in the log which I do not see in the files you sent. Does the counter start high, and is then reduced or what is happening.

d0nghyunkang commented 2 years ago

I understand , thanks! The speed drops quite dramatically. It starts fast in the first 1 minute or so, with more than 1000 companies added. Soon after it drops to hundreds, and within 1 hour or more, it gets to below hundred.

gronlund commented 2 years ago

Then my guess is that it is a problem with downloading large data from elasticsearch over https, as the program is doing, for some reason. Best guess is security/firewall/...

d0nghyunkang commented 2 years ago

Sorry for the late reply. I was away for business travel...

I tried increasing the computing power (number of CPUs and RAM), but to no avail -- the speed still drops quickly. I'm less worried about the speed than the number of missing rows, as I can tolerate the slow speed.

I compare a sample of firms with the web DBA (at https://datacvr.virk.dk/), and I can see my updates table have many missing pieces of information compared to the web version. I wonder if the following errors (?) in the log files are causing the missing rows. What do you think about these messages in the consumer log files? There are many lines like:

(MySQLdb.IntegrityError) (1062, "Duplicate entry 'ST Holdings S.à r.l.' for key 'Navne.navn'")
[SQL: INSERT INTO `Navne` (navn) VALUES (%s)]
[parameters: (('PPD SCANDINAVIA AB',), ('Richard Andreas Budeng',), ('Leonardo André Cuscona',), ('Online Brands LLC',), ('Marc Dilly',), ('Marja Kristina Renvall',), ('Bjørn Bo Kristoffer Forsen',), ('Helder  Boavida',)  ... displaying 10 of 61 total bound parameter sets ...  ('Christian Harald Bang-Melchior',), ('Antje Siem',))]
(Background on this error at: https://sqlalche.me/e/14/gkpj) - attempt 0

Thanks.

d0nghyunkang commented 2 years ago

I also see these messages in the command line during update. I've tried to search a bit to understand what it means, but I couldn't figure it out. But it does contain the message about "Lock wait timeout exceeded"

273555it [55:13, 14.49it/s]Exception in thread Thread-34087:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1808, in _execute_context
    self.dialect.do_execute(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/opt/conda/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
MySQLdb.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/cvrparser/elastic_cvr_extract.py", line 340, in worker
    session.query(table_class).filter(table_class.enhedsnummer.in_(enh)).delete(synchronize_session=False)
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3191, in delete
    result = self.session.execute(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1692, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1620, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1487, in _execute_clauseelement
    ret = self._execute_context(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1851, in _execute_context
    self._handle_dbapi_exception(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2032, in _handle_dbapi_exception
    util.raise_(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1808, in _execute_context
    self.dialect.do_execute(
  File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
  File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
    db.query(q)
  File "/opt/conda/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
    _mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: DELETE FROM `Updates` WHERE `Updates`.enhedsnummer IN (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: (4003754700, 4002079790, 4002548705, 4002079806, 4002547988, 4002629843, 4002548015, 4007286793, 4007388056, 4002549715, 4006250208, 4002549382, 4002079080, 4002549405, 4002629463, 4002549412, 4003717878, 4002549418, 4007383422, 4002080169, 4002555808, 4002080446, 4002555828, 4002555868, 4002074539, 4006373123, 4002556515, 4003587500, 4002556542, 4002556562, 4006445970, 4002556568, 4007030638, 4002556182, 4007451038, 4002556195, 4003583410, 4002556225, 4002313535, 4002520488, 4002366423, 4002520772, 4006844434, 4002366436, 4002520815, 4006357927, 4002520835, 4003586206, 4002521148, 4003570224, 4003586239, 4002521168, 4006725643, 4002079019, 4002521175, 4003832679, 4003582026, 4002521188, 4006961747, 4002079783, 4002521492, 4006626381, 4002521548, 4002629823)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
gronlund commented 2 years ago

The duplicate key errors are because the different workers are trying to use the database at the same time and may insert unique data while the other worker is handling a company with the same unique data. This creates some issues which I check for and leads to errors that are then fixed. This was the price I payed for easy parallelism to speed things up. If one reduce the worker count to 1 this should not happen. I think there is an option for that but I have to check.

The slow time is not because of cpu, I think someting is throttling your download speed as all the progress counter is showing is number of downloaded companies. As I said if the mysql database has issues then at some point the queue is filled and the download slows down. I can try and see if there is an easy way to detect this later this week.

Can you send some cvr number of companies missing rows then I can check if I have any trouble with them?

gronlund commented 2 years ago

I added an option to set number of workers python -m cvrparser update -n 1 should do the trick to one worker. Then there should be no errors like this I think. If there is a timeout in the producer due to queue being full then the producer log should give a timet error

d0nghyunkang commented 2 years ago

Thanks! I'm trying with python -m cvrparser update -n 1. I don't see similar errors in the log files anymore. I still have a problem with slowing speed; I'll let this download finish and compare the results.

For the list of CVRs with missing information, I'll collect them and share with you in the coming days.

d0nghyunkang commented 2 years ago

compare old and new.zip

I'm sharing some tables from old and new versions that I have. They are not about CVR numbers, but other tables which grabbed my attention when I compared them at first. They have differences which were easy to notice.

The attached file has two versions of tables. The first version is, labeled "old", from my colleague, and the tables there contain more rows. The second version, labeled 'new', is from my recent download, and they have some missing rows.

(The two versions have few other differences, which are formatting issues, such as the delimiter character or wrappers for values)

In particular, if you look at virksomhedsstatus.csv, the two versions (old vs. new) have two differences:

  1. the coding of a status is different. For instance, in the old version, 22;"AKTIV" vs. new version: "36","AKTIV"
  2. the new version's number of rows is smaller by one. The new version does not have the row 18;"Tvangsopløst".

If you look at, virksomhedsform.csv, the difference is that: The new version does not have the row 245;"REG";"Region";"DST"

I will add actual CVRs of firms later! Hope this helps in the meanwhile.

gronlund commented 2 years ago

The code number for instance 22, "Aktiv", are uniques ids automatically generated by the database and has no meaning by themselves and are only relevant for joining with the update table to get the full information. When you download from scrach new ids are issued depending on the order they are inserted, in the mean time someone at DBA may have cleaned the database and for instance removed the instance "Tvangsopløst" and made it into "TVANGSOPLØST" or something like that and hence that name would be gone. The virksomhedsstatus just shows the unique strings seen in the data for data piece of information (I am pretty sure that difference in upper and lower case makes the strings different in the system). The same for virksomhedsform. Another possibility is that an error was made, the best way to determine this is looking in the old database and find the company having 18;"Tvangsopløst" as a state and then we can check that company to see if there is a bug.

gronlund commented 2 years ago

I just filled the table from scratch and in my table

virksomhedsformkode, kortbeskrivelse, langbeskrivelse, ansvarligdataleverandoer

'235', 'SOV', 'Selvstændig offentlig virksomhed', 'E&S' exists so I do not know why it is not in yours. Here 235 is a code number from DBA For virksomhedsform i get 22 rows

'1','NORMAL' '2','UNDER TVANGSOPLØSNING' '3','TVANGSOPLØST' '4','UNDER FRIVILLIG LIKVIDATION' '5','OPLØST EFTER FRIVILLIG LIKVIDATION' '6','UNDER KONKURS' '7','OPLØST EFTER KONKURS' '8','OPLØST EFTER SPALTNING' '9','OPLØST EFTER ERKLÆRING' '10','OPLØST EFTER FUSION' '11','UDEN RETSVIRKNING' '12','SLETTET' '13','UNDER REASSUMERING' '14','UNDER REKONSTRUKTION' '15','slettet' '16','Slettet' '17','OPLØST EFTER GRÆNSEOVERSKRIDENDE HJEMSTEDSFLYTNING' '18','OPLØST' '19','Under reassumering' '20','OPLØST EFTER GRÆNSEOVERSKRIDENDE FUSION' '21','UNDER REASUMMERING' '22','AKTIV'

d0nghyunkang commented 1 year ago

Hi, sorry for the late response (I got little sick...)

comparison_forshare.xlsx

I'm sharing a list of CVRs with missing information. In other words, these rows are found only in the old Update table, but not in the new one. To focus on differences, I focus on the subset of the Update table, by looking at status updates. To be specific, I condition by felttype=="status" OR felttype=="virksomhedsstatus".

This list is not exhaustive -- I compared old and new tables manually from the top of the table, which was ascending-sorted by enhedsnummer. So, there are many more rows which are found only in the old version, but not the new version.

I manually compared about 300 rows between old and new tables. My sense is that there is no case where the new table has rows that are not found in the new table. In other words, the new table is strictly a subset of the old table.

Also, I noticed that, when an 'enhedsnummer' is missing from the new table, its entire status updates are missing. For example, in my file, all 3 rows with "enhedsnummer==9102" are missing in the new table, rather than only part of the rows.

I did some sanity checks myself by comparing these missing rows in the web CVR (https://datacvr.virk.dk/). I could not find any particular pattern, which led to the omission in the new dataset.

d0nghyunkang commented 1 year ago

I updated the list to be exhaustive -- the shared file now contains all firms that are missing in the new version. (Please note that the old version was downloaded around 2021-12-12, that's why the maximum value of sidstopdateret is that date) only_in_old.zip

gronlund commented 1 year ago

Hi again Apologise for the long wait. I have been very busy and completely forgot. I checked a few random entries in your empty list and they are not missing in my table that I recently updated.

SELECT * FROM more_bleeding.virk_virksomhedsstatus where enhedsnummer in (6, 947392, 672078)

6 57981717 1 NORMAL 1976-07-01 00:00:00 2200-01-01 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1975-12-30 00:00:00 1991-04-17 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1992-06-23 00:00:00 2200-01-01 00:00:00 2015-03-03 01:00:00 672078 36940816 2 UNDER TVANGSOPLØSNING 1991-04-18 00:00:00 1992-06-22 00:00:00 2015-03-03 01:00:00 947392 13411581 1 NORMAL 1989-08-01 00:00:00 2018-08-15 00:00:00 2018-09-27 16:13:26 947392 13411581 10 OPLØST EFTER FUSION 2018-08-16 00:00:00 2018-08-16 00:00:00 2018-09-27 16:13:28

which matches your csv file.

I will try a complete emptying of my table and retry

gronlund commented 1 year ago

The new test done and the result is the same.

enhedsnummer cvrnummer virksomhedsstatuskode virksomhedsstatus gyldigfra gyldigtil sidstopdateret

6 57981717 1 NORMAL 1976-07-01 00:00:00 2200-01-01 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1975-12-30 00:00:00 1991-04-17 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1992-06-23 00:00:00 2200-01-01 00:00:00 2015-03-03 01:00:00 672078 36940816 3 UNDER TVANGSOPLØSNING 1991-04-18 00:00:00 1992-06-22 00:00:00 2015-03-03 01:00:00 947392 13411581 1 NORMAL 1989-08-01 00:00:00 2018-08-15 00:00:00 2018-09-27 16:13:26 947392 13411581 9 OPLØST EFTER FUSION 2018-08-16 00:00:00 2018-08-16 00:00:00 2018-09-27 16:13:28

So it is not clear where the error is yes

gronlund commented 1 year ago

Seems the package install manager install the wrong elastic search. Install a version 6.* instead of 7. I will try and update install script at some point to make it work as intended.

d0nghyunkang commented 1 year ago

I see. Hope this solves the issue :) Thanks a lot!

d0nghyunkang commented 1 year ago

Hi Allan, could you kindly help me with the issue? I've tried installing the elasticsearch package separately before installing cvrparser, but it didn't work out.

gronlund commented 1 year ago

I would guess you need to install it after so that the installation does not overwrite it. If i find some time (which is not guaranteed) i will try and update the install script which seems to pick the wrong version. It seems to be specified correctly in the file i use but clearly it is not since it does not work.