Closed chime3 closed 2 months ago
Instead of downloading the files and parsing them, is it possible to fetch the indexableText
for each file? See https://developers.google.com/drive/api/guides/file#indexable-text
According to ChatGPT:
To fetch the indexable text property of a Google Drive file using the Google Drive API, you'll need to use the files.get method with the fields parameter specifying the indexableText field.
We don't want to store the files, we are just extracting text from documents that have text content.
We can export google spreadsheet and presentation as pdf or other corresponding types like ppt, xls etc. Do you want to process them as pdf to parse?
Let's not parse spreadsheets. Convert presentations to PDF, then parse them.
doc, xls, ppt etc are non-Google document. Do you want to parse them as well?
Yes to parsing .doc
and .ppt
.
There are many non-Google document in addition to DocumentType. Are you planning to extend supported file types?
Not at this stage.
Should we list non-document files in G Drive like zip, mp4 as well? I mean OTHER type will include zip, mp4 etc?
I think we should list them and include a link to them, but not parse them.
Instead of downloading the files and parsing them, is it possible to fetch the
indexableText
for each file? See https://developers.google.com/drive/api/guides/file#indexable-text
Extracted indexableText
from google documents/slides/presentations (PUSHED)
In terms of non-Google document, to avoid downloading we can convert
and then can get indexableText
from converted documents.
But this requires write
permission.
// Convert the file(doc/docx, ppt/pptx, xls/xlsx) to a Google Docs/Slides/Presentations format
const response = await drive.files.copy({
fileId: fileId,
requestBody: {
mimeType: mimeType, // Convert to Google Docs/Slides/Presentations
},
});
const documentId = response.data.id; // newly created file ID
// Process as like google documents
Otherwise, we should download non-Google document to parse content.
Please confirm which one we will proceed, @tahpot
We don't want to store the files, we are just extracting text from documents that have text content.
Yeah, we are extracting text now not storing file.
Let's not parse spreadsheets. Convert presentations to PDF, then parse them.
We can get
indexableText
from google spreadsheets now. Do you still not want to parse spreadsheets?
Yes to parsing
.doc
and.ppt
.Okay Not at this stage.
Okay I think we should list them and include a link to them, but not parse them.
Okay
We can get
indexableText
from google spreadsheets now. Do you still not want to parse spreadsheets?
Let's use indexableText
for any file that supports it, so yes to spreadsheets.
Instead of downloading the files and parsing them, is it possible to fetch the
indexableText
for each file? See https://developers.google.com/drive/api/guides/file#indexable-textExtracted
indexableText
from google documents/slides/presentations (PUSHED)In terms of non-Google document, to avoid downloading we can convert
- PDF/DOC/DOCX => google document
- PPT/PPTX => google presentation
- XLS/XLSX => google spreadsheet
and then can get
indexableText
from converted documents. But this requireswrite
permission.
Let's download and parse. Add an option in the handler for maxFileSizeToIndex
with default = 10MB.
One more change.
Please switch from using the document schema, to using this new file schema: https://common.schemas.verida.io/file/v0.1.0/schema.json
For google documents, fileDataId
should always be undefined, but uri
should always be defined.
One more change.
Please switch from using the document schema, to using this new file schema: https://common.schemas.verida.io/file/v0.1.0/schema.json
For google documents,
fileDataId
should always be undefined, buturi
should always be defined.
Updated schema and passed unit test. Also checked Google drive data using web interface.
One more thing in extracting extension @tahpot
mimetype
to determine extension.We can not recognize all file extensions. Please clarify on this if we use type constraints or not.
Feedback:
Export only supports Docs Editors files.
). From this stackoverflow it indicates files.get()
may need to be used instead?pptx-parser
with officeparser
doc
and ppt
because those types are deprecated in recent libraries and we need to use libreoffice-convert
to convert doc/ppt
into docx/pptx
to use trending packages, but libreoffice-convert
requires libreoffice pre installed. So ignored parsing doc
and ppt
.Feedback:
- Can confirm tests pass
- There was an issue with Google Slides API not being enabled, so I have enabled it. Will we have the same issue with other types like drawings or sheets?
- When running the sync on my gmail, I receive this error in the sync logs and no files are saved (
Export only supports Docs Editors files.
). From this stackoverflow it indicatesfiles.get()
may need to be used instead?- The failure of one item, shouldn't break the whole sync process, so I'm not sure what is happening to prevent any items being saved.
All done
All good to merge once the handler is updated to handle the last page fix in the other google handlers.
Okay, thank you
Closes #85
What's changed
mimeType
forpdf
and google documents(doc, spreadsheet etc )DocumentType
matchingmimeType
pdf
and google document only (excluding spreadsheet and presentation)How to test these changes
yarn run test tests/provider/google/gdrive-document.tests.ts
What's pending
pdf
or other corresponding types likeppt
,xls
etc. Do you want to process them as pdf to parse?doc
,xls
,ppt
etc are non-Google document. Do you want to parse them as well?DocumentType
. Are you planning to extend supported file types?zip
,mp4
as well? I meanOTHER
type will includezip
,mp4
etc?