paracrawl / Domain_Adaptation

InDomain detection is a tool designed to extract in-domain data from a large collections of data.
GNU General Public License v3.0
1 stars 1 forks source link

Corpus unnecessarily in RAM #30

Closed kpu closed 4 years ago

kpu commented 5 years ago

https://github.com/paracrawl/Domain_Adaptation/blob/432916d54f537342bcffb30f4968c2e19a5be98e/scripts/ScorePoolData.py#L153

You don't need the whole corpus in RAM. Stream it.
This appears to be done so you can do XML or something. Which is overkill for one extra column of data.