neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

Ensure length of sections is less than max nlp length #27

Closed davidmezzetti closed 3 years ago

davidmezzetti commented 3 years ago

Some text sections in CORD-19 are extremely long, often with large RNA sequences as text which won't split into sentences. Add logic to handle these types of sections.