Closed kmunger closed 6 years ago
Not the same, just very close. I turns out the that the minimum google frequency is not a very strong determinant of the probabilities, and that the minimums are not very different.
load("analysis_article/AJPS_replication/data/fitted_BT_model.Rdata")
data(data_corpus_sotu, package = "quanteda.corpora")
# subset to make the tagging faster in this example
data_corpus_sotu <- corpus_subset(data_corpus_sotu, year < 1800)
predict_readability(BT_best, newdata = data_corpus_sotu,
verbose = TRUE, baseline_year = 1800)
# Starting predict_readability (sophistication v0.65)...
# ...using BT_best as fitted BT model; data_corpus_sotu as newdata
# ...tagging parts of speech
# ...computing word lengths in characters
# ...computing baselines from Google frequencies
# ...aggregating to sentence level
# ...computing predicted values
# ...finished; elapsed time: 4.51 seconds.
# lambda prob scaled
# Washington-1790 -3.775170 0.1681448 5.345790
# Washington-1790b -4.368282 0.1004762 -29.767683
# Washington-1791 -4.079174 0.1297877 -12.651905
# Washington-1792 -3.800474 0.1646351 3.847735
# Washington-1793 -3.718028 0.1762895 8.728679
# Washington-1794 -3.885695 0.1532469 -1.197559
# Washington-1795 -3.987394 0.1405104 -7.218330
# Washington-1796 -3.819777 0.1619975 2.704939
# Adams-1797 -3.794218 0.1654973 4.218099
# Adams-1798 -4.126730 0.1245105 -15.467289
# Adams-1799 -4.196062 0.1171474 -19.571915
predict_readability(BT_best, newdata = data_corpus_sotu,
verbose = TRUE, baseline_year = 2000)
# Starting predict_readability (sophistication v0.65)...
# ...using BT_best as fitted BT model; data_corpus_sotu as newdata
# ...tagging parts of speech
# ...computing word lengths in characters
# ...computing baselines from Google frequencies
# ...aggregating to sentence level
# ...computing predicted values
# ...finished; elapsed time: 4.92 seconds.
# lambda prob scaled
# Washington-1790 -3.775167 0.1681452 5.345948
# Washington-1790b -4.368279 0.1004764 -29.767524
# Washington-1791 -4.079171 0.1297880 -12.651747
# Washington-1792 -3.800471 0.1646355 3.847894
# Washington-1793 -3.718026 0.1762899 8.728837
# Washington-1794 -3.885693 0.1532473 -1.197400
# Washington-1795 -3.987391 0.1405107 -7.218171
# Washington-1796 -3.819774 0.1619979 2.705098
# Adams-1797 -3.794215 0.1654977 4.218258
# Adams-1798 -4.126727 0.1245108 -15.467131
# Adams-1799 -4.196059 0.1171477 -19.571756
Update: In 465247d I fixed a bug, and the new results look like this:
library("quanteda")
load("analysis_article/AJPS_replication/data/fitted_BT_model.Rdata")
data(data_corpus_sotu, package = "quanteda.corpora")
# subset to make the tagging faster in this example
data_corpus_sotu <- corpus_subset(data_corpus_sotu, Date < "1800-01-01")
predict_readability(BT_best, newdata = data_corpus_sotu,
verbose = TRUE, baseline_year = 1800)
# Starting predict_readability (sophistication v0.65)...
# ...using BT_best as fitted BT model; data_corpus_sotu as newdata
# ...tagging parts of speech
# ...computing word lengths in characters
# ...computing baselines from Google frequencies
# ...aggregating to sentence level
# ...computing predicted values
# ...finished; elapsed time: 5.12 seconds.
# lambda prob scaled
# Washington-1790 -3.775170 0.1681448 5.345790
# Washington-1790b -4.368282 0.1004762 -29.767683
# Washington-1791 -4.079174 0.1297877 -12.651905
# Washington-1792 -3.800474 0.1646351 3.847735
# Washington-1793 -3.718028 0.1762895 8.728679
# Washington-1794 -3.885695 0.1532469 -1.197559
# Washington-1795 -3.987394 0.1405104 -7.218330
# Washington-1796 -3.819777 0.1619975 2.704939
# Adams-1797 -3.794218 0.1654973 4.218099
# Adams-1798 -4.126730 0.1245105 -15.467289
# Adams-1799 -4.196062 0.1171474 -19.571915
predict_readability(BT_best, newdata = data_corpus_sotu,
verbose = TRUE, baseline_year = 2000)
# Starting predict_readability (sophistication v0.65)...
# ...using BT_best as fitted BT model; data_corpus_sotu as newdata
# ...tagging parts of speech
# ...computing word lengths in characters
# ...computing baselines from Google frequencies
# ...aggregating to sentence level
# ...computing predicted values
# ...finished; elapsed time: 5.63 seconds.
# lambda prob scaled
# Washington-1790 -3.775157 0.1681466 5.346556
# Washington-1790b -4.367517 0.1005453 -29.722386
# Washington-1791 -4.079152 0.1297902 -12.650570
# Washington-1792 -3.800471 0.1646355 3.847894
# Washington-1793 -3.718006 0.1762928 8.730014
# Washington-1794 -3.885683 0.1532486 -1.196825
# Washington-1795 -3.987376 0.1405126 -7.217279
# Washington-1796 -3.819767 0.1619988 2.705528
# Adams-1797 -3.794069 0.1655179 4.226937
# Adams-1798 -4.126648 0.1245194 -15.462451
# Adams-1799 -4.195844 0.1171700 -19.558973
The results are the same regardless of how the year argument is specified.
Example: