ryanwhalen / patent_similarity_data

US utility patent similarity data creation and analysis tools
MIT License
25 stars 5 forks source link

It is very slow to load the dataset into pandas #2

Open slucyp opened 2 years ago

slucyp commented 2 years ago

Hi Ryan,

Thanks very much for posting the code and the data of patent similarity. I am a Ph.D. student in Business school at the University of Pittsburgh. I am very interested in using the patent similarity data derived from text data in my dissertation. I followed your code and have downloaded the data and written it into the local SQL database. But when I tried to run the code in the Jupyter Notebook patent_similarity_data, it takes forever to load the dataset (even when I tried to just query a single data point, like v1 = cur.execute('''SELECT vector FROM doc2vec WHERE patent_id = '9000000' ''').fetchone() . I was wondering whether it is a common issue? My computer has 16GB RAM, and I think probably it should not be an issue for just querying one data point. Probably I am new in terms of working with such a big dataset, any suggestions and advice are deeply appreciated. Thanks very much in advance!

Best, Lucy

ryanwhalen commented 2 years ago

The first thing I'd do is check to see the DB is appropriately indexed. Try using an SQLite browser (e.g. https://sqlitestudio.pl) to run the same query. If it is slow there as well, try building an index on the columns you plan on selecting on.

-R

Ryan Whalen Web: ryanwhalen.com Research: https://ssrn.com/author=1544651 HK: (+852) 5694 2073 USA: (+1) 773.679.1344

On Mon, Mar 14, 2022 at 10:39 AM Lucy Wang @.***> wrote:

Hi Ryan,

Thanks very much for posting the code and the data of patent similarity. I am a Ph.D. student in Business school at the University of Pittsburgh. I am very interested in using the patent similarity data derived from text data in my dissertation. I followed your code and have downloaded the data and written it into the local SQL database. But when I tried to run the code in the Jupyter Notebook patent_similarity_data https://github.com/ryanwhalen/patent_similarity_data, it takes forever to load the dataset (even when I tried to just query a single data point, like v1 = cur.execute('''SELECT vector FROM doc2vec WHERE patent_id = '9000000' ''').fetchone() . I was wondering whether it is a common issue? My computer has 16GB RAM, and I think probably it should not be an issue for just querying one data point. Probably I am new in terms of working with such a big dataset, any suggestions and advice are deeply appreciated. Thanks very much in advance!

Best, Lucy

— Reply to this email directly, view it on GitHub https://github.com/ryanwhalen/patent_similarity_data/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKUTCFDYSGBF7UIYXQJNHTU72YENANCNFSM5QUHWY7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

slucyp commented 2 years ago

Thanks very much! I followed your advice and index each table in the DB using patent_id or id. The query process is very quick in the SQLiteStudio, but it is still very slow when I use pd.read_sql_query() in the jupyter notebook. I was wondering do you have further suggestions for this issue? Thanks again!

ryanwhalen commented 2 years ago

In that case it is likely a PANDAS issue. You can try to use the Python SQLite3 library to do the DB query and then convert that data into a dataframe in another way.

-R

Ryan Whalen Web: ryanwhalen.com Research: https://ssrn.com/author=1544651 HK: (+852) 5694 2073 USA: (+1) 773.679.1344

On Tue, Mar 15, 2022 at 9:05 AM Lucy Wang @.***> wrote:

Thanks very much! I followed your advice and index each table in the DB using patent_id or id. The query process is very quick in the SQLiteStudio, but it is still very slow when I use `pd.read_sql_query()' in the jupyter notebook. I was wondering do you have further suggestions for this issue? Thanks again!

— Reply to this email directly, view it on GitHub https://github.com/ryanwhalen/patent_similarity_data/issues/2#issuecomment-1067456074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKUTCAAQADRPUFE5RZEAD3U77OZZANCNFSM5QUHWY7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>