langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
42.67k stars 5.9k forks source link

xlsx can recognize the same amount of data exported from the knowledge base, but csv cannot #5910

Closed JayHsu5 closed 2 days ago

JayHsu5 commented 1 month ago

Self Checks

Dify version

0.6.11

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create a knowledge base
  2. Csv file import
  3. Copy csv data to xlsx

✔️ Expected Behavior

Both csv and xlsx files can count characters

❌ Actual Behavior

csv cannot

dosubot[bot] commented 1 month ago

The issue where CSV files cannot count characters correctly while XLSX files can in Dify version 0.6.11 is due to differences in the ETL (Extract, Transform, Load) solutions used for parsing these file formats. Different ETL solutions may have variations in file extraction effects, leading to inconsistencies in character counting between CSV and XLSX files [1].

To address this, you can try the following steps:

  1. Preprocessing CSV Files: Ensure that the CSV file is appropriately segmented and cleaned before importing. This includes:

    • Replacing continuous spaces, newlines, and tabs.
    • Deleting all URLs and email addresses.
  2. Customizing Segmentation Rules: Adjust the segmentation rules in Dify ETL to better handle CSV files. This includes setting segmentation identifiers, maximum segment length, and segment overlap length to maintain semantic relevance.

  3. Encoding Handling: Ensure that the encoding of the CSV file is correctly detected. Both Dify ETL and Unstructured ETL use autodetect_encoding=True for CSV files, which attempts to automatically detect the encoding. Verify that the encoding is correctly detected and handled.

For more detailed information on Unstructured ETL’s data processing methods, refer to the official documentation [1].

If the issue persists, consider checking for any updates or fixes in the latest version of Dify or reviewing the error logs for more insights.

To continue talking to Dosu, mention @dosu.

JohnJyong commented 1 month ago

pls attach some screenshot about this issue @JayHsu5

JayHsu5 commented 1 month ago

image

JohnJyong commented 1 month ago

If it is convenient, you can upload the desensitized version of your csv file. because form your screenshot and desc , I can't figure out what the problem is , thanks @JayHsu5