Open d0nghyunkang opened 2 years ago
Hi. The easy solution is not to use the update functionality and instead insert from scratch unless only a very short time has passed since last update. The update is very slow because it has to check if a company has information that has been updated and if this is the case it deletes all information for that company and reinserts it. So if many companies needs to be updated this is very inefficient. At the time of writing the code it seemed the only robust thing for many reasons. Another issue is/was the employee count tables, which at the time of writing could have been updated in the database without changing the data version information in the data, so It was not possible to make an easy detection of updates and hence that part is not perfect!. I do not think I ever fixed that and the semantics of how to detect updates may have changed in the data. So the update may in fact not give the same data as a fresh insert. If a long time has passed since last update, using update is very slow! Clearing the table and starting from scratch is much faster. Maybe I should write that somewhere but I have not really worked on the project for years. Also note that existing rows may be deleted at a later time by DBA perhaps due to errors or other reasons, and hence inconsistencies may occur, but I think this is quite rare.
Does this make sense? -Allan
Thanks a lot for the explanation. I'll try with get_regs
from the beginning. I'll come back after I have the results!
Sorry I missed a part of your question. The get_regs is a new feature I added upon request not that long ago to extract data from a new data source in the DBA database. I think I only made it such that it would insert everything from scratch, but I would have to check to code so see if it actually can continue from where it is. That may be possible. More importantly, get_regs only inserts registration data, from, the at the time new source of data into a table, and it does not insert the entire CVR information into the database, making update and get_regs completely incomparable! as update inserts the full cvr data (excluding the registrations).
Hope that makes sense as well :)
Thanks, I get it. Indeed, when I tried get_regs
before, it only returned the Registrations table, but nothing else.
I think I've been always inserting from scratch, since I work on a new machine (via cloud computing) at each download attempt. Despite this, I've always had the same downloading problem -- slow speed and missing rows.
My order of execution is like this:
python -m cvrparser dbsetup -t # This creates tables
python -m cvrparser reconfigure # Sets up access
python -m cvrparser update # Download tables
I would reconfigure before dbsetup but as long as you do not change database name it should be fine I guess :).
I just tested the following on my laptop
CREATE SCHEMA dummy2
DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_bin ; on mysqlworkbench
conda create -n cvrtest -python=3.8
pip install git+https://github.com/gronlund/cvrdata
pip install bs4
python -m cvrparser reconfigure
python -m cvrparser dbsetup -t
python -m cvrparser update
And it inserts about 1500-2500 companies per second initially and then slowing down as the database gets filled with entries but should finish in reasonable time. I will try and let it finish but it seems to be working. The missing rows I am not sure about.
I can see in the log file that a data entry seems to be to long for the corresponding SQL field. Maybe I should just make them all very long I can fix that at I think. Let me take a look.
I found the issue. Some company had merged 15 different website names into one name making a very long string which did not fit In the database. I made the table column accept longer strings as I do not try to fix raw data even if it is not meaningful.
Hope it works and remember to create the tables again.
I see, thanks a lot! I'll try downloading it now.
Just to update you in the meanwhile: I've been downloading the tables, but it's still taking long. It's been 24 hours, and the speed now is 10-20 it/s. I'll try finishing the download (may take 1-2 more nights) and compare the number of rows again!
It should only take an hour or two to fill the database so something is wrong. The system is made as a producer consumer, where the the producer downloads from DBA and inserts data into a queue that workers extract from and insert into MySQL The it counter only shows how many companies the producer are downloading per second which should be in the thousands. So maybe your connection to DBA is being throttled/blocked somehow it should be much much faster. There is a possibility that the workers are to slow and the producer then needs to wait as their shared queue has a maximal size, which would reduce download speed. In this case the producer should produce time out errors in the log which I do not see in the files you sent. Does the counter start high, and is then reduced or what is happening.
I understand , thanks! The speed drops quite dramatically. It starts fast in the first 1 minute or so, with more than 1000 companies added. Soon after it drops to hundreds, and within 1 hour or more, it gets to below hundred.
Then my guess is that it is a problem with downloading large data from elasticsearch over https, as the program is doing, for some reason. Best guess is security/firewall/...
Sorry for the late reply. I was away for business travel...
I tried increasing the computing power (number of CPUs and RAM), but to no avail -- the speed still drops quickly. I'm less worried about the speed than the number of missing rows, as I can tolerate the slow speed.
I compare a sample of firms with the web DBA (at https://datacvr.virk.dk/), and I can see my updates
table have many missing pieces of information compared to the web version. I wonder if the following errors (?) in the log files are causing the missing rows. What do you think about these messages in the consumer log files? There are many lines like:
(MySQLdb.IntegrityError) (1062, "Duplicate entry 'ST Holdings S.à r.l.' for key 'Navne.navn'")
[SQL: INSERT INTO `Navne` (navn) VALUES (%s)]
[parameters: (('PPD SCANDINAVIA AB',), ('Richard Andreas Budeng',), ('Leonardo André Cuscona',), ('Online Brands LLC',), ('Marc Dilly',), ('Marja Kristina Renvall',), ('Bjørn Bo Kristoffer Forsen',), ('Helder Boavida',) ... displaying 10 of 61 total bound parameter sets ... ('Christian Harald Bang-Melchior',), ('Antje Siem',))]
(Background on this error at: https://sqlalche.me/e/14/gkpj) - attempt 0
Thanks.
I also see these messages in the command line during update
. I've tried to search a bit to understand what it means, but I couldn't figure it out. But it does contain the message about "Lock wait timeout exceeded"
273555it [55:13, 14.49it/s]Exception in thread Thread-34087:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1808, in _execute_context
self.dialect.do_execute(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
res = self._query(query)
File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
db.query(q)
File "/opt/conda/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.9/site-packages/cvrparser/elastic_cvr_extract.py", line 340, in worker
session.query(table_class).filter(table_class.enhedsnummer.in_(enh)).delete(synchronize_session=False)
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3191, in delete
result = self.session.execute(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1692, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1620, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
return connection._execute_clauseelement(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1487, in _execute_clauseelement
ret = self._execute_context(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1851, in _execute_context
self._handle_dbapi_exception(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2032, in _handle_dbapi_exception
util.raise_(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
raise exception
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1808, in _execute_context
self.dialect.do_execute(
File "/opt/conda/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 206, in execute
res = self._query(query)
File "/opt/conda/lib/python3.9/site-packages/MySQLdb/cursors.py", line 319, in _query
db.query(q)
File "/opt/conda/lib/python3.9/site-packages/MySQLdb/connections.py", line 254, in query
_mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
[SQL: DELETE FROM `Updates` WHERE `Updates`.enhedsnummer IN (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: (4003754700, 4002079790, 4002548705, 4002079806, 4002547988, 4002629843, 4002548015, 4007286793, 4007388056, 4002549715, 4006250208, 4002549382, 4002079080, 4002549405, 4002629463, 4002549412, 4003717878, 4002549418, 4007383422, 4002080169, 4002555808, 4002080446, 4002555828, 4002555868, 4002074539, 4006373123, 4002556515, 4003587500, 4002556542, 4002556562, 4006445970, 4002556568, 4007030638, 4002556182, 4007451038, 4002556195, 4003583410, 4002556225, 4002313535, 4002520488, 4002366423, 4002520772, 4006844434, 4002366436, 4002520815, 4006357927, 4002520835, 4003586206, 4002521148, 4003570224, 4003586239, 4002521168, 4006725643, 4002079019, 4002521175, 4003832679, 4003582026, 4002521188, 4006961747, 4002079783, 4002521492, 4006626381, 4002521548, 4002629823)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
The duplicate key errors are because the different workers are trying to use the database at the same time and may insert unique data while the other worker is handling a company with the same unique data. This creates some issues which I check for and leads to errors that are then fixed. This was the price I payed for easy parallelism to speed things up. If one reduce the worker count to 1 this should not happen. I think there is an option for that but I have to check.
The slow time is not because of cpu, I think someting is throttling your download speed as all the progress counter is showing is number of downloaded companies. As I said if the mysql database has issues then at some point the queue is filled and the download slows down. I can try and see if there is an easy way to detect this later this week.
Can you send some cvr number of companies missing rows then I can check if I have any trouble with them?
I added an option to set number of workers python -m cvrparser update -n 1 should do the trick to one worker. Then there should be no errors like this I think. If there is a timeout in the producer due to queue being full then the producer log should give a timet error
Thanks! I'm trying with python -m cvrparser update -n 1
. I don't see similar errors in the log files anymore. I still have a problem with slowing speed; I'll let this download finish and compare the results.
For the list of CVRs with missing information, I'll collect them and share with you in the coming days.
I'm sharing some tables from old and new versions that I have. They are not about CVR numbers, but other tables which grabbed my attention when I compared them at first. They have differences which were easy to notice.
The attached file has two versions of tables. The first version is, labeled "old", from my colleague, and the tables there contain more rows. The second version, labeled 'new', is from my recent download, and they have some missing rows.
(The two versions have few other differences, which are formatting issues, such as the delimiter character or wrappers for values)
In particular, if you look at virksomhedsstatus.csv
, the two versions (old vs. new) have two differences:
22;"AKTIV"
vs. new version: "36","AKTIV"
18;"Tvangsopløst"
.If you look at, virksomhedsform.csv
, the difference is that:
The new version does not have the row 245;"REG";"Region";"DST"
I will add actual CVRs of firms later! Hope this helps in the meanwhile.
The code number for instance 22, "Aktiv", are uniques ids automatically generated by the database and has no meaning by themselves and are only relevant for joining with the update table to get the full information. When you download from scrach new ids are issued depending on the order they are inserted, in the mean time someone at DBA may have cleaned the database and for instance removed the instance "Tvangsopløst" and made it into "TVANGSOPLØST" or something like that and hence that name would be gone. The virksomhedsstatus just shows the unique strings seen in the data for data piece of information (I am pretty sure that difference in upper and lower case makes the strings different in the system). The same for virksomhedsform. Another possibility is that an error was made, the best way to determine this is looking in the old database and find the company having 18;"Tvangsopløst" as a state and then we can check that company to see if there is a bug.
I just filled the table from scratch and in my table
'235', 'SOV', 'Selvstændig offentlig virksomhed', 'E&S' exists so I do not know why it is not in yours. Here 235 is a code number from DBA For virksomhedsform i get 22 rows
'1','NORMAL' '2','UNDER TVANGSOPLØSNING' '3','TVANGSOPLØST' '4','UNDER FRIVILLIG LIKVIDATION' '5','OPLØST EFTER FRIVILLIG LIKVIDATION' '6','UNDER KONKURS' '7','OPLØST EFTER KONKURS' '8','OPLØST EFTER SPALTNING' '9','OPLØST EFTER ERKLÆRING' '10','OPLØST EFTER FUSION' '11','UDEN RETSVIRKNING' '12','SLETTET' '13','UNDER REASSUMERING' '14','UNDER REKONSTRUKTION' '15','slettet' '16','Slettet' '17','OPLØST EFTER GRÆNSEOVERSKRIDENDE HJEMSTEDSFLYTNING' '18','OPLØST' '19','Under reassumering' '20','OPLØST EFTER GRÆNSEOVERSKRIDENDE FUSION' '21','UNDER REASUMMERING' '22','AKTIV'
Hi, sorry for the late response (I got little sick...)
I'm sharing a list of CVRs with missing information. In other words, these rows are found only in the old Update
table, but not in the new one. To focus on differences, I focus on the subset of the Update
table, by looking at status updates. To be specific, I condition by felttype=="status"
OR felttype=="virksomhedsstatus"
.
This list is not exhaustive -- I compared old and new tables manually from the top of the table, which was ascending-sorted by enhedsnummer
. So, there are many more rows which are found only in the old version, but not the new version.
I manually compared about 300 rows between old and new tables. My sense is that there is no case where the new table has rows that are not found in the new table. In other words, the new table is strictly a subset of the old table.
Also, I noticed that, when an 'enhedsnummer' is missing from the new table, its entire status updates are missing. For example, in my file, all 3 rows with "enhedsnummer==9102" are missing in the new table, rather than only part of the rows.
I did some sanity checks myself by comparing these missing rows in the web CVR (https://datacvr.virk.dk/). I could not find any particular pattern, which led to the omission in the new dataset.
I updated the list to be exhaustive -- the shared file now contains all firms that are missing in the new version. (Please note that the old version was downloaded around 2021-12-12, that's why the maximum value of sidstopdateret
is that date)
only_in_old.zip
Hi again Apologise for the long wait. I have been very busy and completely forgot. I checked a few random entries in your empty list and they are not missing in my table that I recently updated.
SELECT * FROM more_bleeding.virk_virksomhedsstatus where enhedsnummer in (6, 947392, 672078)
6 57981717 1 NORMAL 1976-07-01 00:00:00 2200-01-01 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1975-12-30 00:00:00 1991-04-17 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1992-06-23 00:00:00 2200-01-01 00:00:00 2015-03-03 01:00:00 672078 36940816 2 UNDER TVANGSOPLØSNING 1991-04-18 00:00:00 1992-06-22 00:00:00 2015-03-03 01:00:00 947392 13411581 1 NORMAL 1989-08-01 00:00:00 2018-08-15 00:00:00 2018-09-27 16:13:26 947392 13411581 10 OPLØST EFTER FUSION 2018-08-16 00:00:00 2018-08-16 00:00:00 2018-09-27 16:13:28
which matches your csv file.
I will try a complete emptying of my table and retry
The new test done and the result is the same.
6 57981717 1 NORMAL 1976-07-01 00:00:00 2200-01-01 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1975-12-30 00:00:00 1991-04-17 00:00:00 2015-02-10 01:00:00 672078 36940816 1 NORMAL 1992-06-23 00:00:00 2200-01-01 00:00:00 2015-03-03 01:00:00 672078 36940816 3 UNDER TVANGSOPLØSNING 1991-04-18 00:00:00 1992-06-22 00:00:00 2015-03-03 01:00:00 947392 13411581 1 NORMAL 1989-08-01 00:00:00 2018-08-15 00:00:00 2018-09-27 16:13:26 947392 13411581 9 OPLØST EFTER FUSION 2018-08-16 00:00:00 2018-08-16 00:00:00 2018-09-27 16:13:28
So it is not clear where the error is yes
Seems the package install manager install the wrong elastic search. Install a version 6.* instead of 7. I will try and update install script at some point to make it work as intended.
I see. Hope this solves the issue :) Thanks a lot!
Hi Allan, could you kindly help me with the issue? I've tried installing the elasticsearch package separately before installing cvrparser, but it didn't work out.
I would guess you need to install it after so that the installation does not overwrite it. If i find some time (which is not guaranteed) i will try and update the install script which seems to pick the wrong version. It seems to be specified correctly in the file i use but clearly it is not since it does not work.
Hi Allan,
First of all, thanks for publishing package! I have two issues while using it.
python -m cvrparser update
. I took about 100 hours to finish it. The downloading speed is pretty low, but I'm not sure why it's the case; when I runpython -m cvrparser get_regs
, it's much faster. To be concrete, 15 it/s vesus 2000 it/s.update
command returns tables that have some missing rows. I compared with another tables my colleague downloaded few months ago. I suspect whether it’s because of the error messages I get on the log files. I have attached all the log files for your reference (one consumer log file is particularly long, so I had to compress it).My computing environment is: Ubuntu 22.04. Python 3.9.10. MySQL 8.0.30. CPU with 8 cores and 47GB Memory.
Thank you!
logs.zip