pprados / langchain-googledrive

An external version of a pull request for langchain.
Apache License 2.0
26 stars 10 forks source link

Support XLSX documents #9

Closed ururk closed 6 months ago

ururk commented 6 months ago

Issue you'd like to raise.

In doing some testing, I'm finding this library skips xlsx files. When I look a the code no loader is specified here:

https://github.com/pprados/langchain-googledrive/blob/5b82e7355b7de285beb69a0b191913cbbd65f163/langchain_googledrive/utilities/google_drive.py#L298-L314

Suggestion:

Is it possible to import UnstructuredExcelLoader and use that to index xlsx files? It feels like everything is in place to support this file format.

pprados commented 6 months ago

Activate the level info for log, and you can see which package to install for that.

Le ven. 5 avr. 2024 à 11:35, John Pariseau @.**@.>> a écrit : CAUTION: External email. Be cautious with links and attachments.

Issue you'd like to raise.

In doing some testing, I'm finding this library skips xlsx files. When I look a the code no loader is specified here:

https://github.com/pprados/langchain-googledrive/blob/5b82e7355b7de285beb69a0b191913cbbd65f163/langchain_googledrive/utilities/google_drive.py#L310https://urldefense.com/v3/__https://github.com/pprados/langchain-googledrive/blob/5b82e7355b7de285beb69a0b191913cbbd65f163/langchain_googledrive/utilities/google_drive.py*L310__;Iw!!OrxsNty6D4my!8RC0LKlAW7udFfD12jCsMIK0Jtw1cCMGVrvNW8tU5Q4Kq0WDvaZP_4EtRbMWOVVusfyg3zEIRyyjQL1qDlp_WD4MRQ8n$

Suggestion:

Is it possible to import UnstructuredExcelLoader and use that to index xlsx files? It feels like everything is in place to support this file format.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/pprados/langchain-googledrive/issues/9__;!!OrxsNty6D4my!8RC0LKlAW7udFfD12jCsMIK0Jtw1cCMGVrvNW8tU5Q4Kq0WDvaZP_4EtRbMWOVVusfyg3zEIRyyjQL1qDlp_WJ8lnCnD$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AABR7FWKANYDZ2RRCJANXT3Y3ZV4VAVCNFSM6AAAAABFYZZEXWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDONJRHA3DSNA__;!!OrxsNty6D4my!8RC0LKlAW7udFfD12jCsMIK0Jtw1cCMGVrvNW8tU5Q4Kq0WDvaZP_4EtRbMWOVVusfyg3zEIRyyjQL1qDlp_WNAIq2st$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Philippe PRADOS Senior Manager Référent | +33 6 20 66 71 00<tel:+33620667100> OCTO Technology - Part of Accenture 34, avenue de l'Opéra - 75002 Parishttps://www.google.com/maps?q=34+Avenue+de+l%27Opera,+75002+Paris www.octo.comhttps://www.octo.com/# - blog.octo.comhttps://blog.octo.com/# --- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security, AI-powered support capabilities, and assessment of internal compliance with Accenture policy. Your privacy is important to us. Accenture uses your personal data only in compliance with data protection laws. For further information on how Accenture processes your personal data, please see our privacy statement at https://www.accenture.com/us-en/privacy-policy.

ururk commented 6 months ago

Activate the level info for log, and you can see which package to install for that.

Doesn't the mime type mapper have to be updated though?

         "application/vnd.openxmlformats-officedocument."
         "spreadsheetml.sheet": cast( 
             TYPE_LOAD, partial(UnstructuredExcelLoader, mode=mode) 
         ),  # XLSX
pprados commented 6 months ago

Yes,

Read the method default_conv_loader() https://github.com/pprados/langchain-googledrive/blob/master/langchain_googledrive/utilities/google_drive.py#L166 and update the attribute conv_mapping https://github.com/pprados/langchain-googledrive/blob/master/langchain_googledrive/utilities/google_drive.py#L566C5-L566C17 . You can map with what you want.

Le ven. 5 avr. 2024 à 11:35, John Pariseau @.***> a écrit :

Issue you'd like to raise.

In doing some testing, I'm finding this library skips xlsx files. When I look a the code no loader is specified here:

https://github.com/pprados/langchain-googledrive/blob/5b82e7355b7de285beb69a0b191913cbbd65f163/langchain_googledrive/utilities/google_drive.py#L310 Suggestion:

Is it possible to import UnstructuredExcelLoader and use that to index xlsx files? It feels like everything is in place to support this file format.

— Reply to this email directly, view it on GitHub https://github.com/pprados/langchain-googledrive/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABR7FWKANYDZ2RRCJANXT3Y3ZV4VAVCNFSM6AAAAABFYZZEXWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDONJRHA3DSNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pprados commented 6 months ago

The version 0.1.40 manage excel file. Thanks