google / patents-public-data

Patent analysis using the Google Patents Public Datasets on BigQuery
https://bigquery.cloud.google.com/dataset/patents-public-data:patents
Apache License 2.0
531 stars 162 forks source link

BERT for Patents yields 1024 element array, but embedding_v1 is 64 element #49

Open sthorpe11 opened 3 years ago

sthorpe11 commented 3 years ago

How should I generate an embedding equivalent to embedding_v1? BERT for Patents generates a 1024 element embedding, but the embedding_v1 is a 64 element embedding.

wetherbeei commented 3 years ago

The model to generate embedding_v1 has not been released, and we also haven't released pre-embedded patents with the BERT model in BigQuery.

You could experiment with learning a mapping from BERT to embedding_v1 with a linear layer - they should match up well because they're both based on text. embedding_v1 is a set-of-words unigram model.

sanealytics commented 2 years ago

Can you give some insight into how you dealt with limited window size for BERT? Eg did you choose between abstract/patent/etc; Pool things? Something else?

sthorpe11 commented 2 years ago

Hi Saurabh,

We limited the window to claim 1.

Scott


From: Saurabh Bhatnagar @.> Sent: Thursday, December 2, 2021 1:43 PM To: google/patents-public-data @.> Cc: sthorpe11 @.>; Author @.> Subject: Re: [google/patents-public-data] BERT for Patents yields 1024 element array, but embedding_v1 is 64 element (#49)

Can you give some insight into how you dealt with limited window size for BERT? Eg did you choose between abstract/patent/etc; Pool things? Something else?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpatents-public-data%2Fissues%2F49%23issuecomment-984986213&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225068084%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FAgVn6tHvK3T6%2BrbI2mBb3riU85pZ1dlXbK2dzRDpIg%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGKNX4UQBHUJKWIQ25X6AV3UO7K7ZANCNFSM47STZ6JQ&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225078042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=p8ICdI2HV4Yjx3vLe9NHwIYMvpz7xmO6VYcby0jbHjM%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225078042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AppXLzUaP0L2Q3kdkfI9Iy325o3quxrPDpTY3hNlS5E%3D&reserved=0 or Androidhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7C%7Cf1a7f1780d5b4165a2f008d9b5d46c2f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637740746225078042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PnJPsbAEHEag5VJIDBMrg82aGdMVMLe%2FcCYnwD3kftE%3D&reserved=0.

sanealytics commented 2 years ago

Thanks for that quick response. This repo is a great resource.

KNT-AJ commented 5 months ago

This repo is great. Thank you! Any plans to release the model that generated embedding_v1 or the BERT pre-embedded patents?