jina-ai / serve

☁️ Build multimodal AI applications with cloud-native stack
https://jina.ai/serve
Apache License 2.0
21.1k stars 2.22k forks source link

DOCQA.. #3882

Closed jyotikhetan closed 2 years ago

jyotikhetan commented 2 years ago

I am trying to parse a pdf & apply the DOCQA model to it.Roberta as well as BERT..!! But I am not getting right answer for it.. whereas independently it's works accurately... !!

This is my flow.yml

jtype: Flow
version: '1'
with:
  workspace: $JINA_WORKSPACE
  port_expose: $JINA_PORT
executors:
  - name: transformer
    uses: 'jinahub://TransformerTorchEncoder/v0.1'
    uses_with:
      device: 'cuda'
  - name: indexer
    uses: 'jinahub://SimpleIndexer/old'

This is my query.yml

jtype: Flow
version: '1'
with:
  workspace: $JINA_WORKSPACE
  port_expose: $JINA_PORT
executors:
  - name: transformer
    uses: 'jinahub://TransformerTorchEncoder/v0.1'
    uses_with:
      device: 'cuda'
  - name: indexer
    uses: 'jinahub://SimpleIndexer/old'
  - name: generator
    uses: Generator
    py_modules: "flows/generator_roberta.py"
JoanFM commented 2 years ago

Hey @jyotikhetan ,

Could you please provide more details about the code in flows/generator_roberta.py? maybe also about some data in the input, how do u run the flow and what is the query with which u have problems?

jyotikhetan commented 2 years ago

### generator_roberta.py

from jina import Executor, Document, DocumentArray, requests
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    pipeline,
)

class Generator(Executor):
    answer_model_name = "deepset/bert-large-uncased-whole-word-masking-squad2"
    answer_model = AutoModelForQuestionAnswering.from_pretrained(answer_model_name)
    answer_tokenizer = AutoTokenizer.from_pretrained(answer_model_name)
    nlp = pipeline(
        "question-answering", model=answer_model, tokenizer=answer_tokenizer
    )

    @requests
    def generate(self, docs: DocumentArray, **kwargs) -> DocumentArray:
        for doc in docs.traverse_flat(('r',)):
            context = " ".join([match.text for match in doc.matches])
            # context = doc.matches.append(Document(text=answer))
            # context = "/home/jyoti/jina2/jina-star-wars-qa/data/StarWars_Descriptions.txt"
            # qa_input = {"question": doc.text,"context": context}
            qa_input = {"question": doc.text, "context": context}
            result = self.nlp(qa_input)
            result = DocumentArray(Document(result))
            return result

### my execution file

import os
import sys
from typing import Iterator

import click
from jina import Flow, Document, DocumentArray
import logging
from pdf_segment import PDFSegmenter

MAX_DOCS = int(os.environ.get("JINA_MAX_DOCS", 0))
cur_dir = os.path.dirname(os.path.abspath(__file__))

def pdf_process():
    pdf_name = ".../1706.03762.pdf"
    segmentor = PDFSegmenter(pdf_name)
    segmentor.text_crafter(save_file_name="data")
    segmentor.image_crafter()

def config(dataset: str = "star-wars") -> None:
    if dataset == "star-wars":
        os.environ["JINA_DATA_FILE"] = os.environ.get("JINA_DATA_FILE", "...../jina-text/data.txt")
    os.environ.setdefault('JINA_WORKSPACE', os.path.join(cur_dir, 'workspace'))
    os.environ.setdefault(
        'JINA_WORKSPACE_MOUNT',
        f'{os.environ.get("JINA_WORKSPACE")}:/workspace/workspace')
    os.environ.setdefault('JINA_LOG_LEVEL', 'INFO')
    os.environ.setdefault('JINA_PORT', str(45678))

def input_generator(file_path: str, num_docs: int) -> Iterator[Document]:
    with open(file_path) as file:
        lines = file.readlines()
    num_lines = len(lines)
    if num_docs:
        for i in range(min(num_docs, num_lines)):
            yield Document(text=lines[i])
    else:
        for i in range(num_lines):
            yield Document(text=lines[i])

def index(num_docs: int) -> None:
    flow = Flow().load_config('flows/flow-index.yml')
    data_path = os.path.join(os.path.dirname(__file__), os.environ.get("JINA_DATA_FILE", None))
    with flow:
        flow.post(on="/index", inputs=input_generator(data_path, num_docs), show_progress=True)

def query(top_k: int) -> None:
    flow = Flow().load_config('flows/flow-query.yml')
    with flow:
        text = input('Please type a question: ')
        doc = Document(content=text)

        result = flow.post(on='/search', inputs=DocumentArray([doc]),
                           # parameters={'top_k': top_k},
                           line_format='text',
                           return_results=True,
                           )
        for doc in result[0].data.docs:
            print(f"\n\nAnswer: {doc.tags['answer']}")

@click.command()
@click.option(
    '--task',
    '-t',
    type=click.Choice(['index', 'query', 'pdf_process'], case_sensitive=False),
)
@click.option('--num_docs', '-n', default=MAX_DOCS)
@click.option('--top_k', '-k', default=5)
@click.option('--data_set', '-d', type=click.Choice(['star-wars']), default='star-wars')
def main(task: str, num_docs: int, top_k: int, data_set: str) -> None:
    config()
    workspace = os.environ['JINA_WORKSPACE']
    logger = logging.getLogger('star-wars-qa')

    if 'index' in task:
        if os.path.exists(workspace):
            logger.error(
                f'\n +------------------------------------------------------------------------------------+ \
                    \n |                                   🤖🤖🤖                                           | \
                    \n | The directory {workspace} already exists. Please remove it before indexing again.  | \
                    \n |                                   🤖🤖🤖                                           | \
                    \n +------------------------------------------------------------------------------------+'
            )
            sys.exit(1)

    if 'query' in task:
        if not os.path.exists(workspace):
            logger.info(f"The directory {workspace} does not exist. Running indexing...")
            index(num_docs)

    if 'pdf_process' in task:
        if not os.path.exists(workspace):

            logger.info(f"The directory {workspace} does not exist. Running indexing...")
            index(num_docs)

    if task == 'index':
        index(num_docs)
    elif task == 'query':
        # query()
        query(top_k)
    elif task == "pdf_process":
        pdf_process()

if __name__ == '__main__':
    main()

dataset I m trying on this is research paper attention all we need pdf which I parsed and saved in text

Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring significantly
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
to-German translation task, improving over the existing best results, including
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
our model establishes a new single-model state-of-the-art BLEU score of 41.8 after
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
best models from the literature. We show that the Transformer generalizes well to
other tasks by applying it successfully to English constituency parsing both with
large and limited training data.
1
Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
attention and the parameter-free position representation and became the other person involved in nearly every
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
arXiv:1706.03762v5  [cs.CL]  6 Dec 2017transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization tricks [21] and conditional
computation [32], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
2
Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building
block, computing hidden representations in parallel for all input and output positions. In these models,
the number of operations required to relate signals from two arbitrary input or output positions grows
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
it more difficult to learn dependencies between distant positions [12]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
of a single sequence in order to compute a representation of the sequence. Self-attention has been
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrence and have been shown to perform well on simple-language question answering and
language modeling tasks [34].
To the best of our knowledge, however, the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequence-
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
self-attention and discuss its advantages over models such as [17, 18] and [9].
3
Model Architecture
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].
Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence
of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output
sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive
[10], consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
2Figure 1: The Transformer - model architecture.
3.1
Encoder and Decoder Stacks
Encoder:
The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder:
The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.

### Query I asked

On what kind of dataset model was trained..??

JoanFM commented 2 years ago

May I ask what it the expected outcome, and what is the pipeline u follow when working without Jina to extract the expected result.

The first thing I would check, is how many lines are being indexed? This seems like a non-optimal way of chunking your document.

jyotikhetan commented 2 years ago

So query I asked was On what kind of dataset model was trained.. answer expected was WMT 2014 English-German

this is the normal pipeline, I used

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
JoanFM commented 2 years ago

So what was the result from this pipeline? and what is the result using Jina?

jyotikhetan commented 2 years ago

WMT 2014 English-German this is result i ma getting using pipeline & using jina I am getting Answer: models

As you mentioned, I shall check the number of lines indexed..!!

jyotikhetan commented 2 years ago

So i checked my indexing also , I mean just 2 paragraph , indexed it & checked the answer ... still no luck !! only i am getting different wrong answer

Training This section describes the training regime for our models. 5.1 Training Data and Batching model is trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source- target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. 5.2 Hardware and Schedule We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). 5.3 Optimizer We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and ϵ = 10−9. We varied the learning rate over the course of training, according to the formula: lrate = d−0.5 model · min(step_num−0.5, step_num · warmup_steps−1.5) (3)

answer i am getting is --- Answer: 8 NVIDIA P100 GPUs

JoanFM commented 2 years ago

So i checked my indexing also , I mean just 2 paragraph , indexed it & checked the answer ... still no luck !! only i am getting different wrong answer

Training This section describes the training regime for our models. 5.1 Training Data and Batching model is trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source- target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. 5.2 Hardware and Schedule We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). 5.3 Optimizer We used the Adam optimizer [20] with β1 = 0.9, β2 = 0.98 and ϵ = 10−9. We varied the learning rate over the course of training, according to the formula: lrate = d−0.5 model · min(step_num−0.5, step_num · warmup_steps−1.5) (3)

answer i am getting is --- Answer: 8 NVIDIA P100 GPUs

It is very hard to understand, what is the granularity that u want to index? Shouldn't you try to break your sentences into smaller pieces so that the model works. How do you expect it to work exactly as ur plain implementation if the data u provide to the model is not the same?

JoanFM commented 2 years ago

Also, how do u expect to have the same output as in https://github.com/jina-ai/jina/issues/3882#issuecomment-962897278 if the sentence: 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.' is not part of the index?

jyotikhetan commented 2 years ago

So I was just giving the model reference in which I tested on hugging face UI..!! Then I have provided Jina with the exact same dataset I have provided to FARMmodel by haystack... & I am getting the right answer..!! That's why I am not able to figure out where I am going wrong here... I checked the length of the input document also..!! Please help...

JoanFM commented 2 years ago

What does it mean providing the same input?

When you do.

def input_generator(file_path: str, num_docs: int) -> Iterator[Document]:
    with open(file_path) as file:
        lines = file.readlines()
    num_lines = len(lines)
    if num_docs:
        for i in range(min(num_docs, num_lines)):
            yield Document(text=lines[i])
    else:
        for i in range(num_lines):
            yield Document(text=lines[i])

what is the text added to each of the Document? How many of these Document are created?

Can you please describe in detail what is the input u want to provide to the model, and what do u expect to extract? Also clarify what does it mean to u to provide the same input to FARMmodel?

jyotikhetan commented 2 years ago

This is the paper I provided which is pdf... attention all we need..https://arxiv.org/pdf/1706.03762.pdf using pdfsegmenter I extract text in .txt file...!! As of now i using single document so extracting one pdf extracting the data & dumping in one text file..!!!

https://haystack.deepset.ai/ I used this search engine, provided same pdf used same roberta model... & got the right answer ..!!

I am not able to figure out where i am making mistake in jina..

JoanFM commented 2 years ago

Can u show the exact way u provided the same pdf?

jyotikhetan commented 2 years ago
from haystack.document_store import InMemoryDocumentStore, SQLDocumentStore
from haystack.reader import FARMReader, TransformersReader
from haystack.retriever.sparse import TfidfRetriever
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http

def tutorial3_basic_qa_pipeline_without_elasticsearch():
    document_store = InMemoryDocumentStore()

    from haystack.file_converter.pdf import PDFToTextConverter

    converter = PDFToTextConverter(remove_numeric_tables=True)
    dicts = converter.convert(file_path='/home/jyoti/hay_afterupadate/1706.03762/1706.03762.pdf', meta=None)
    dicts = convert_files_to_dicts(dir_path="/home/jyoti/hay_afterupadate/1706.03762/", split_paragraphs=True)
    document_store.write_documents(dicts)
    retriever = TfidfRetriever(document_store=document_store)
    reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

    from haystack.pipeline import ExtractiveQAPipeline
    pipe = ExtractiveQAPipeline(reader, retriever)
    prediction = pipe.run(
        query="on what dataset model was trained?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
    )

    print(prediction)

if __name__ == "__main__":
    tutorial3_basic_qa_pipeline_without_elasticsearch()

This pipeline where I used the same pdf

JoanFM commented 2 years ago

Would it be okey to send the pdf file if there is no sensitive data?

jyotikhetan commented 2 years ago

I have sent the link... it's research paper attention all we need 1706.03762.pdf

JoanFM commented 2 years ago

Ok, now I understand what u are saying.

I have seen 2 major things.

Something happened when u load the pdf into the input generator.

The sentence: We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs is the one from where the QA model extracts rhe response. This sentence is not properly added to the index.

If u check in your generated Documents, u should make sure that this sentence is properly parsed.

Another potential source of problem is that the haystack pipeline is based on TFIDFRetriever to retrieve the pool of candidated from which to extract the answer, however in the Jina pipeline, u are using a Transformer model to extract the candidates, therefore it is impossible that the 2 pipelines will be equivalent.

So, summarizing.

jyotikhetan commented 2 years ago

Thank you for clarifying it..!! As you mentioned, It was an indexing issue at my end..!! Solved the issue. In my execution file while running query task.. I have to limit the length of the document... otherwise I am getting error as

    indexer@14429[E]:ValueError('cannot reshape array of size 425472 into shape (568,768)')

add "--quiet-error" to suppress the exception details

Please help..Thank you in advance

JoanFM commented 2 years ago

hey @jyotikhetan ,

Could you set the environment variable JINA_LOG_LEVEL to DEBUG ?And then share all the logs?

Have u also made sure that the workspace is cleared before running the example?

JoanFM commented 2 years ago

Also, what do u mean by limiting the size of the document?

jyotikhetan commented 2 years ago

log.txt This is the log file...!!!

& Yes I have deleted my previous workspace..!!

Limiting the size means... I have taken one PDF, parsed it, save the text in .TXT then indexed it, & when query a question it threw an error.

whereas when I reduce the size (i.e I kept only 2 paragraphs in .txt) then indexed it, then asked the query it worked..!!!

Thank you

JoanFM commented 2 years ago

what is the exact jina version u are working with?

jyotikhetan commented 2 years ago

jina --version - 2.1.5 I have upgraded to the latest version, then tried, then my Encoder was throwing an error, That DocumentArray doesn't have a batch.

JoanFM commented 2 years ago

@jyotikhetan maybe u can share here the resulting indexing file. U can zip the folder and share here so that it can be easily tested by us. Also u can share the exact query u were trying.

As per the new version, it is a known problem that we are fixing in new executor versions

jyotikhetan commented 2 years ago

workspace.zip This is my indexed file(workspace file). My query -- I asked a question, On what dataset model was trained?

Okay, so I have to chuck a small part of document 7 then pass to query part...is it?

JoanFM commented 2 years ago

Hello @jyotikhetan ,

What I found is that there are elements in the index that do not have a valid embedding. (I see that u are trying to store a Document with empty text that leads to an invalid embedding, and therefore our SimpleIndexer fails at constructing the matrix to do the cosine similarity.

I would suggest that u add at index time an extra Pod in the Flow that filters invalid documents before indexing them.

Something like this.

from jina import Executor, Document, DocumentArray, requests

class Filter(Executor):

    @requests
    def filter(self, docs: DocumentArray, **kwargs) -> DocumentArray:
        filtered_docs = DocumentArray([doc for doc in docs if doc.embedding is not None])
        return filtered_docs
JoanFM commented 2 years ago

Hey @jyotikhetan ,

Was this useful to solve your problem?

jyotikhetan commented 2 years ago

Hey hi @JoanFM, Yes it did..!! I was testing if I can make it work on different Pdf by indexing together... Now, I am available to do that also..!! Thank you so much for your help..!!

JoanFM commented 2 years ago

Very happy to hear @jyotikhetan, please do not hesitate to open any issue in case you need it. I am going to close this one!