nissl-lab / toxy

.net text extraction framework
Apache License 2.0
361 stars 107 forks source link

NuGet Ko-Fi netstandard2.0 License

What's Toxy

Toxy is a .NET data/text extraction framework similar to Apache Tika in Java. It supports a lot of popular formats such as docx, xlsx, xls, pdf, csv, txt, epub, html and so on.

image

Why Toxy

In the past, we have to use IFilter to extract texts from other documents. But Toxy is platform independent. It will try to support not only Windows but also Linux. Toxy is very easy to use and friendly. You don't need to care much about what extension you are extracting because it is a clever framework to help identify the file formats and extract the data/text into a unified structure.

Toxy Objects