neuroscout / neuroscout-paper

Neuroscout paper analysis repository
https://neuroscout.github.io/neuroscout-paper/
Other
0 stars 0 forks source link

inspect LM surprisal #40

Closed rbroc closed 2 years ago

rbroc commented 2 years ago
rbroc commented 2 years ago

recap of what done so far: extracted surprisal, entropy and losses for GPT with different window sizes, using full transcripts vs. force-aligned for Narratives and Sherlock. (BERT cannot be used for forward language modeling as-is, as it'll always expect a [SEP] token). Fun facts about these metrics, entropy is higher for force-aligned (where there's no punctuation) while surprisal is lower for force-aligned (probably the lack of punctuation decreases the model's confidence on top-predicted words: download Loss is obviously lower when there is punctuation.

I've looked into the extent to which entropy and surprisal correlate between:

  1. Transcripts and force-aligned transcript (the latter with no punctuation and capitalization + presence of tokens)
  2. Force-aligned transcripts with capitalization and no tokens and lowercased transcripts with tokens.

In short, in the first case they correlate ~.65-.75, in the latter ~.85-.90. They are both acceptable correlation levels for our goals, so I'd say we shouldn't worry too much about working with transcripts. Interesting fact though is that punctuation alone makes a .20 difference. I've only looked at this for one of the narratives, but will do that more systematically for all of them asap.

Next steps:

rbroc commented 2 years ago

(note that this is not necessarily relevant for the paper but keeping it here just in case)

satra commented 2 years ago

@rbroc - that looks great. thanks for the summary.

rbroc commented 2 years ago

closing this, for the LM part we've done some more comprehensive analyses and have a preprint (coming soon), we also now have GPT entropy and surprisal extractors and have played a bit with them - will push to a separate repo if we decide to keep working on it