kbenoit / sophistication

R package associated with Benoit, Munger and Spirling (2017) paper(s)
42 stars 7 forks source link

unclear reference to lambda baseline #2

Closed kbenoit closed 6 years ago

kbenoit commented 6 years ago

In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom.

@ArthurSpirling @kmunger can you recall which one this is supposed to be?

ArthurSpirling commented 6 years ago

Let me take a look tmrw.

On Sun, Jan 7, 2018 at 4:17 AM Kenneth Benoit notifications@github.com wrote:

In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom.

@ArthurSpirling https://github.com/arthurspirling @kmunger https://github.com/kmunger can you recall which one this is supposed to be?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTngGrgww51yYH3Y30YBtfns5ZVvApks5tIIujgaJpZM4RVlpN .

-- Via iPhone, apologies for terseness

ArthurSpirling commented 6 years ago

Actually, wasn't this user defined? That is, it was up to the user to decide what they wanted to compare a given lambda to -- for example, one could specify an interest in comparing a particular snippet to e.g one by Eisenhower (assuming that was already in the data) as a reference?

If I have that wrong, then I'm p certain it was intended to default to the fifth grade text, which is the hardcoded -2.17... figure.

Can you clarify what reference_bottom is? (I assume it's the hardest snippet in the data, or sth)

AS

On Sun, Jan 7, 2018 at 4:17 AM, Kenneth Benoit notifications@github.com wrote:

In https://github.com/kbenoit/sophistication/blob/master/R/predict.R#L138, we refer to reference, but this was from older code before we changed the arguments to reference_top and reference_bottom.

@ArthurSpirling https://github.com/arthurspirling @kmunger https://github.com/kmunger can you recall which one this is supposed to be?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTngGrgww51yYH3Y30YBtfns5ZVvApks5tIIujgaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger commented 6 years ago

That's right-- the "reference" call is from older code. The current code has the hardcoded top and bottom values that derived by simply sorting the extreme lambdas on the SOTU. When I did this, I left the older code in there and just added an extra column with the new, hardcorded approach.

The solution is to just get rid of the old code, which I can do easily. But the longer-term question is whether we should allow this to be user defined? Should we use the SOTU values as defaults and allow users to specify if they want to change them?

ArthurSpirling commented 6 years ago

Yes, I think that's what we want: default to the present values, but allow users to specify something other than defaults should they want.

On Fri, Jan 12, 2018 at 3:41 PM, Kevin Munger notifications@github.com wrote:

That's right-- the "reference" call is from older code. The current code has the hardcoded top and bottom values that derived by simply sorting the extreme lambdas on the SOTU. When I did this, I left the older code in there and just added an extra column with the new, hardcorded approach.

The solution is to just get rid of the old code, which I can do easily. But the longer-term question is whether we should allow this to be user defined? Should we use the SOTU values as defaults and allow users to specify if they want to change them?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-357272125, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnmYqQ-KTz9AXc_j5vA_aQ7hK3kwlks5tJ30ogaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger commented 6 years ago

Ok, on that.

And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided).

I'm currently rewriting the documentation to reflect what we're doing:

The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these.

The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale.

Are these the defaults we want?

ArthurSpirling commented 6 years ago

don't we use the fifth grade texts as 100? That's what the paper implies, no?

On Fri, Jan 12, 2018 at 8:16 PM, Kevin Munger notifications@github.com wrote:

Ok, on that.

And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided).

I'm currently rewriting the documentation to reflect what we're doing:

The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these.

The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale.

Are these the defaults we want?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-357331107, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnjOXJ1mcX-nw76pdXoqeVgZ70kn-ks5tJ72ngaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger commented 6 years ago

Reading back over the documentation, yes, that seems to be the case--and I just checked numbers, which do match up.

So, are these the defaults we want: the baseline probablity comparison is the same as the 100 on the scaled version?

On Fri, Jan 12, 2018 at 5:10 PM, Arthur Spirling notifications@github.com wrote:

don't we use the fifth grade texts as 100? That's what the paper implies, no?

On Fri, Jan 12, 2018 at 8:16 PM, Kevin Munger notifications@github.com wrote:

Ok, on that.

And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided).

I'm currently rewriting the documentation to reflect what we're doing:

The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these.

The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale.

Are these the defaults we want?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2# issuecomment-357331107, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnjOXJ1mcX- nw76pdXoqeVgZ70kn-ks5tJ72ngaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-357368548, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQLe2MkNzRZLcVvvOjPCBnbOBCV8YpAks5tJ9hugaJpZM4RVlpN .

ArthurSpirling commented 6 years ago

that's what makes sense to me, yes: 100 is the 5th grade text, 0 is the hardest SOTU text (which is at college level, by FRE standards). Those being default ends for the 0-100 make sense, and fifth grade texts being the default comparison for the probability calculations.

On Fri, Jan 12, 2018 at 10:38 PM, Kevin Munger notifications@github.com wrote:

Reading back over the documentation, yes, that seems to be the case--and I just checked numbers, which do match up.

So, are these the defaults we want: the baseline probablity comparison is the same as the 100 on the scaled version?

On Fri, Jan 12, 2018 at 5:10 PM, Arthur Spirling <notifications@github.com

wrote:

don't we use the fifth grade texts as 100? That's what the paper implies, no?

On Fri, Jan 12, 2018 at 8:16 PM, Kevin Munger notifications@github.com wrote:

Ok, on that.

And I've realized what the problem is: we've confused the "reference" (used to compute the probability scores) with the endpoints for rescaling. These don't necessarily have to come from the same source, but all three do need to be input as defaults (or user provided).

I'm currently rewriting the documentation to reflect what we're doing:

The default value for "reference" is the lambda across the fifth grade texts--our "prob" output thus calculates the probablity that a text is easier than these.

The default values for "reference_top" and "reference_bottom" come from the extremes of the SOTU corpus, and are used to rescale texts on the 0-100 scale.

Are these the defaults we want?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2# issuecomment-357331107, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnjOXJ1mcX- nw76pdXoqeVgZ70kn-ks5tJ72ngaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2# issuecomment-357368548, or mute the thread https://github.com/notifications/unsubscribe-auth/ AGQLe2MkNzRZLcVvvOjPCBnbOBCV8YpAks5tJ9hugaJpZM4RVlpN .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-357374046, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnpmoi8NBfARSa5wuw6LcHYMeHBwVks5tJ978gaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger commented 6 years ago

Ok, made these changes.

kbenoit commented 6 years ago

Thanks, I think that corrected it. @kmunger with e7ac504 the package now passes the CRAN check - except for the too-large data objects.

Note that I removed the article_manuscript and manuscript_chapter folders, since these should only be in the sophistication-papers repository.

ArthurSpirling commented 6 years ago

Very good - so this will now appear on CRAN as a package? best AS

On Mon, Jan 15, 2018 at 4:36 PM, Kenneth Benoit notifications@github.com wrote:

Thanks, I think that corrected it. @kmunger https://github.com/kmunger with e7ac504 https://github.com/kbenoit/sophistication/commit/e7ac504206f4cd3478f44e51694de27a72603546 the package now passes the CRAN check - except for the too-large data objects.

Note that I removed the article_manuscript and manuscript_chapter folders, since these should only be in the sophistication-papers repository.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-357794622, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnjWfZYvovEtDblXfdKxFmUepfjD4ks5tK8TxgaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kbenoit commented 6 years ago

No, we would need to submit it, but first cut out the large data objects. There is a 5MB size limit on CRAN packages and we are way over that (26.1 Mb). Most of those were for replicating our analysis however, and that could be removed from the package.

There are also some documentation and robustness (testing!) issues that need to be addressed before it's released as a general tool. I've spoken to @kmunger about this and am happy to guide work in this area.

ArthurSpirling commented 6 years ago

Thanks for the clarification - that makes sense.

On Tue, Jan 16, 2018 at 10:11 AM, Kenneth Benoit notifications@github.com wrote:

No, we would need to submit it, but first cut out the large data objects. There is a 5MB size limit on CRAN packages and we are way over that (26.1 Mb). Most of those were for replicating our analysis however, and that could be removed from the package.

There are also some documentation and robustness (testing!) issues that need to be addressed before it's released as a general tool. I've spoken to @kmunger https://github.com/kmunger about this and am happy to guide work in this area.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-357990342, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnqc0d-dTUDloRIFnd0V3awsS6MUkks5tLLwhgaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

kmunger commented 6 years ago

Indeed, I'm happy to start working on this, and @ken any guidance would be appreciated.

I'll go ahead and start removing the large data objects, to get it down to size.

On Tue, Jan 16, 2018 at 10:45 AM, Arthur Spirling notifications@github.com wrote:

Thanks for the clarification - that makes sense.

On Tue, Jan 16, 2018 at 10:11 AM, Kenneth Benoit <notifications@github.com

wrote:

No, we would need to submit it, but first cut out the large data objects. There is a 5MB size limit on CRAN packages and we are way over that (26.1 Mb). Most of those were for replicating our analysis however, and that could be removed from the package.

There are also some documentation and robustness (testing!) issues that need to be addressed before it's released as a general tool. I've spoken to @kmunger https://github.com/kmunger about this and am happy to guide work in this area.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2# issuecomment-357990342, or mute the thread https://github.com/notifications/unsubscribe-auth/ASQTnqc0d- dTUDloRIFnd0V3awsS6MUkks5tLLwhgaJpZM4RVlpN .

-- Deputy Director, Center for Data Science http://cds.nyu.edu/ Director of Graduate Studies, MSDS http://cds.nyu.edu/academics/ms-in-data-science/ Associate Professor of Politics and Data Science New York University http://www.nyu.edu/projects/spirling/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kbenoit/sophistication/issues/2#issuecomment-358005034, or mute the thread https://github.com/notifications/unsubscribe-auth/AGQLe0HdfWqmXK7jqjuTA_BAZXpUOYkUks5tLMQggaJpZM4RVlpN .

kbenoit commented 6 years ago

Best would be to create replication materials needed for our chapter and paper, removing the larger objects from the package as needed, but using the package functions to get the results. Each time you make a data object local, you can remove it from the package.