This series of changes uses the document id to avoid creating edges between sentences in different documents. In order to do that, the document id has to be maintained through out the featurization code, so it's carried along in the SentenceTokens and SentenceFeatures case classes. The document id is then used to pair up sentences from the same document, so that their similarity can then be computed and compared against the threshold for creating graph edges (as before).
This series of changes uses the document id to avoid creating edges between sentences in different documents. In order to do that, the document id has to be maintained through out the featurization code, so it's carried along in the
SentenceTokens
andSentenceFeatures
case classes. The document id is then used to pair up sentences from the same document, so that their similarity can then be computed and compared against the threshold for creating graph edges (as before).