seancarmody / ngramr

R package to query the Google Ngram Viewer
Other
48 stars 9 forks source link

Download Parsing #45

Open econinomista opened 11 months ago

econinomista commented 11 months ago

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies?

All the best and thank you in advance!

seancarmody commented 11 months ago

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “appleSTART”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps.

https://books.google.com/ngrams/info

Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske @.***>, wrote:

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? All the best and thank you in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

econinomista commented 11 months ago

Dear Sean, thank you so much for the quick response. I think I see the problem. However, it should be possible to download for example all the data about apple as part of a trigram or larger context for example and to use a package like spacy to the identify the semantic labels right? Can I use the R package also to download n-grams larger than one? All the best Nikola

Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < @.***>:

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “appleSTART”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps.

https://books.google.com/ngrams/info

Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske @.***>, wrote:

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? All the best and thank you in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/seancarmody/ngramr/issues/45#issuecomment-1746667040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA . You are receiving this because you authored the thread.Message ID: @.***>

seancarmody commented 11 months ago

Unfortunately what you've described is not feasible. Google does provide the raw n-gram data here: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html. However, as you can see this involves a large number of large files. It would not be practical for the package to download and process all of this data when making function calls. Even just downloading the 1-gram file would not be practical but it would have to download every 2-gram file ( http://storage.googleapis.com/books/ngrams/books/20200217/eng/eng-2-ngrams_exports.html) since apple could be in the second place and every 3-gram file etc. If there was a site with a database which could be efficiently queried, the package could make calls to that but I don't know of any such database with direct query access. Google clearly has a database powering the n-gram viewer charts but does not provide direct access to that database. That's why the package works the way it does: it makes calls to the chart viewer and then scrapes the chart data into an R data table. So the package can only match the sort of data the chart viewer can return and it does not allow specific n-gram calls. If you ask for "red apple" it will return the data for that specific 2-gram but it will not return a count of all 2-grams that include "apple".

On Thu, Oct 5, 2023 at 3:00 AM Nikolanoske @.***> wrote:

Dear Sean, thank you so much for the quick response. I think I see the problem. However, it should be possible to download for example all the data about apple as part of a trigram or larger context for example and to use a package like spacy to the identify the semantic labels right? Can I use the R package also to download n-grams larger than one? All the best Nikola

Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < @.***>:

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “appleSTART”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps.

https://books.google.com/ngrams/info

Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske @.***>, wrote:

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? All the best and thank you in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/seancarmody/ngramr/issues/45#issuecomment-1746667040,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/seancarmody/ngramr/issues/45#issuecomment-1747183433, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFVWPLJZHEQ437JE5WYP3X5WBY3AVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBXGE4DGNBTGM . You are receiving this because you commented.Message ID: @.***>

-- Sean Carmody

econinomista commented 11 months ago

Dear Sean,

thank you very much for your reply. I think I will then concentrate on onegrams. Since you say, ngramr is usable for the data that the n gram viewer uses: Is there a possibility to modify the code such that you can also get, f.i., French n-grams or other languages?

All the best and thank you again Nikola

Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < @.***>:

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “appleSTART”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps.

https://books.google.com/ngrams/info

Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske @.***>, wrote:

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? All the best and thank you in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/seancarmody/ngramr/issues/45#issuecomment-1746667040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA . You are receiving this because you authored the thread.Message ID: @.***>

seancarmody commented 11 months ago

No need for modification: the package already allows you do download ngrams from other languages using the 'corpus' argument. For example for French:

ngram("chat", corpus = "fr-2019")

The documentation provides a list of valid corpuses.

Sean.

On Fri, Oct 20, 2023 at 2:06 AM Nikolanoske @.***> wrote:

Dear Sean,

thank you very much for your reply. I think I will then concentrate on onegrams. Since you say, ngramr is usable for the data that the n gram viewer uses: Is there a possibility to modify the code such that you can also get, f.i., French n-grams or other languages?

All the best and thank you again Nikola

Am Mi., 4. Okt. 2023 um 13:18 Uhr schrieb Sean Carmody < @.***>:

Thanks for your interest in the package. The way it works is by scraping the Google N-gram viewer page so it is only possible to provide the data that you can obtain from that page. As far as I know there is no way to determine arbitrary position in the sentence but if you have a look on Google’s information page you can ask for words appearing at the beginning (or end) of a sentence (see link below). So you could run a function call like ngram(c("apple”, “appleSTART”) . Note that this would give you data on all occurrences of “apple” and those occurrences with “apple” at the start of the sentence and the first data set would include the second so you may want to manipulate the data by subtracting the starting occurrences from the total. I hope that helps.

https://books.google.com/ngrams/info

Sean. On 4 Oct 2023 at 8:36 PM +1100, Nikolanoske @.***>, wrote:

Thank you so much for providing this package! I am looking for an opportunity, to download the frequency of many words at the same time and to get the information about their position in the sentences where they are used. For instance: I would like to be able to compare the frequence of "apple" being used as an subject or an object. Is there a possibility to adjust your code and to do so, and further, to also proceed with more calculations than just plotting the frequencies? All the best and thank you in advance! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/seancarmody/ngramr/issues/45#issuecomment-1746667040,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AUAYD6JM25JF5RKFSJZ7RKTX5VAYZAVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGY3DOMBUGA>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/seancarmody/ngramr/issues/45#issuecomment-1771182358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFVWL64BECHYBATWKOOL3YAE6Y3AVCNFSM6AAAAAA5SKFHLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGE4DEMZVHA . You are receiving this because you commented.Message ID: @.***>

-- Sean Carmody