Missing source PDFs - Githubissues

microsoft / AzureSearch_JFK_Files

This repo contains the sample code of the Azure Search and Cognitive Services used to provide insights and analysis around the JFK Files.

MIT License

389 stars 225 forks source link

Missing source PDFs #96

Closed glennmusa closed 4 years ago

glennmusa commented 4 years ago

Attempting to create the indexer returns an error when pulling PDF sources from the sample's hosted blob storage.

Error creating indexer: Error with data source: Error processing blob 'https://azsjfkfiles.blob.core.windows.net/jfkfiles/castro%20operation/docid-32105760.pdf': No connection could be made because the target machine actively refused it 127.0.0.1:9998  Please adjust your data source definition in order to proceed.

This url https://azsjfkfiles.blob.core.windows.net/jfkfiles/castro%20operation/docid-32105760.pdf returns

<Error>
  <Code>ResourceNotFound</Code>
  <Message>The specified resource does not exist. RequestId:697c6496-201e-0034-651e-77cba8000000 Time:2020-08-20T18:16:39.9942896Z</Message>
</Error>

Careyjmac commented 4 years ago

Investigating this, looks like it accidentally got deleted on our side, stay tuned

Careyjmac commented 4 years ago

Apologizes for the confusion, it didn't actually get deleted and I misunderstood your error and how you were trying to access the blob. The first error is a transient error that sometimes occurs with your indexer. Please try rerunning the indexer after waiting a few minutes to resolve, or submit a support case if it consistently repeats over a longer period of time.

The reason when you try to go to the PDF URL directly you get a not found is because you also need to include the provided SAS token URI that grants read access to the sample documents. This link, for example, does work:

https://azsjfkfiles.blob.core.windows.net/jfkfiles/castro%20operation/docid-32105760.pdf?st=2019-05-02T21%3A30%3A00Z&se=9999-12-21T12%3A00%3A00Z&sp=rl&sv=2018-03-28&sr=c&sig=NmOktoltvmd9gLqHidRuQN06xeSXbCVYIVti5prLdmA%3D

glennmusa commented 4 years ago

Ah great. Thanks for providing the token specific URI for finding the document via the browser.

You're also correct that the indexer will throw this error on-and-off when an agent executes it as part of our deployment process.

If we learn more or discover a root cause I'll loop back into this issue. 👍

glennmusa commented 4 years ago

No material findings, but, we're finding that the indexer runs successfully on an immediate second run.