ncbi-nlp / BioWordVec

Other
142 stars 29 forks source link

Regarding the bio_embedding_intrinsic download file #1

Open coolabhishek opened 4 years ago

coolabhishek commented 4 years ago

I see the following error while loading the model UnpicklingError: unpickling stack underflow

Looks like this can be because of the old scipy format of the saved file. Is there a way to get the txt file format?

Thanks Abhishek

kaushikacharya commented 4 years ago

@coolabhishek You can use BioWordVec from BioSentVec which the authors claim as extension of the work in the current repository.

Wiki page explains how to load the BioWordVec: https://github.com/ncbi-nlp/BioSentVec/wiki

adijad20 commented 4 years ago

Hi,

The paper says that all words are converted to a lower case. So if I use the model file to get word vector for a word that contains capital letters (e.g Adrenaline), how will the word embedding be computed for such words?, since there will be no n-grams with Capital letters. Could you please help me here?

Thanks, Aditya

kaushikacharya commented 4 years ago

@adijad20

how will the word embedding be computed for such words?, since there will be no n-grams with Capital letters. Could you please help me here?

Here's my understanding of how the embedding vector would be computed for Adrenaline.

BioWordVec - improving biomedical word embeddings with subword information and MeSH by Zhang et al(2019) mentions that all words were lower cased:

Implementation details. In our experiments, we downloaded the PubMed XML source files from https:// www.nlm.nih.gov/databases/download/pubmed_medline.html. Our PubMed data contains 27,599,238 articles including the titles and abstracts. We extracted the title and abstract texts from the PubMed XML files to construct the PubMed text data. All words were converted to lowercase. The final PubMed text data contain 3,658,450,658 tokens

Here's it mentions how subword embedding model is used to compute word embedding:

Subword embedding model. Bojanowski et al. proposed fastText: a subword embedding model based on the skip-gram model that learns the character n-grams distributed embeddings using unlabeled corpora where each word is represented as the sum of the vector representations of its n-grams. Compared to the word2vec model1, the subword embedding model can make effective use of the subword information and internal word structure to improve the embedding quality.

Now let's first look into FastText's source code:

https://github.com/facebookresearch/fastText/blob/master/python/fasttext_module/fasttext/FastText.py#L120

def get_word_vector(self, word):
      ...
      self.f.getWordVector(b, word)

This calls https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc#L111

void FastText::getWordVector(Vector& vec, const std::string& word) const {
  const std::vector<int32_t>& ngrams = dict_->getSubwords(word);

In this function,

https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L91 const std::vector<int32_t> Dictionary::getSubwords( const std::string& word)

This checks whether word is present in vocabulary or not. If not present i.e. out-of-vocabulary(OOV) word then it calls

https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L172

void Dictionary::computeSubwords(
    const std::string& word,
    std::vector<int32_t>& ngrams,
    std::vector<std::string>* substrings)

This function computes character ngrams which are hashed using the Fowler-Noll-Vo hashing function as mentioned in Enriching Word Vectors with Subword Information by Bojanowski et al(2017)

In order to bound the memory requirements of our model, we use a hashing function that maps n-grams to integers in 1 to K. We hash character sequences using the Fowler-Noll-Vo hashing function (specifically the FNV-1a variant).

Now let's look into the gensim's FastText source code:

Loading the model https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/fasttext.py#L472 self.wv = FastTextKeyedVectors(size, min_n, max_n, bucket, compatible_hash)

Extracting the word vector https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2103 def word_vec(self, word, use_norm=False): In this function you would see how the hashing is used over the n-grams for OOV word: ngram_hashes = ft_ngram_hashes(word, self.min_n, self.max_n, self.bucket, self.compatible_hash)

You can also refer to Polm23's answer in https://stackoverflow.com/questions/50828314/how-does-the-gensim-fasttext-pre-trained-model-get-vectors-for-out-of-vocabulary

Importance of upper case in corpus text:

Though in BioWordVec all words were lower cased, but there's a discussion thread in fastText where its mentioned why we might want to keep upper-case characters too in the corpus.

adijad20 commented 4 years ago

@kaushikacharya,

Thanks for your reply. It was really informative.

I still have two questions:

  1. Do you mean to say that because hashing is used, the n-grams containing capital letters (e.g. Adr) which were not present in training, will hash to some bucket and the this gives the embedding for that n-gram?
  2. Let's say I have a gene symbol "ghrl" which is present in the vocabulary. FastText paper says that the n-grams for this word will also include the entire word . So while calculating the embedding for "ghrl", does it sum up embeddings of all its constituent n-grams or just returns the embeddings for n-gram ?

Thanks, Aditya

kaushikacharya commented 4 years ago

@adijad20

Regarding your 1st question

Execute the following script in https://www.onlinegdb.com/online_c++_compiler to see yourself how the hashing is done for the n-gram Adr

#include <iostream>

using namespace std;

// hash function copied from https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L163
uint32_t hash_func(const std::string& str) {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(int8_t(str[i]));
    h = h * 16777619;
  }
  return h;
}

int main()
{
    // bucket value taken from https://github.com/facebookresearch/fastText/blob/master/src/args.cc
    int bucket = 2000000;
    int32_t h = hash_func("Adr") % bucket;
    cout << "Hash of sub-string: " << h;

    return 0;
}

I am getting the output as Hash of sub-string: 848830

Regarding your 2nd question:

From the FastText paper:

Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to character n-grams). Taking the word where and n = 3 as an example, it will be represented by the character n-grams: <wh, whe, her, ere, re> and the special sequence <where>. Note that the sequence , corresponding to the word her is different from the tri-gram her from the word where. In practice, we extract all the n-grams for n greater or equal to 3 and smaller or equal to 6.

Ultimately, a word is represented by its index in the word dictionary and the set of hashed n-grams it contains.

Along with the n-grams of subwords, it will create n-gram of the word itself provided its within the max size i.e. 6. Thanks for correcting me.

In the source code, have a look at https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L172 void Dictionary::computeSubwords(

for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
      ngram.push_back(word[j++]);

For i = 0, this part of the code will create the n-gram for the entire word :: <ghrl>

In case you are wondering what does (word[j] & 0xC0) == 0x80) do, then read paxdiablo's answer in StackOverflow where the user explains that it is to identify multi-byte sequence in unicode.

adijad20 commented 4 years ago

Thanks @kaushikacharya,

The first question is clear, the substr Adr would hash into some bucket and n-grams in that bucket will give the embeddings for the substring.

For the second question, I understood that n-gram for entire word "ghrl" would be created, but my question is when I ask FastText model for embedding of "ghrl", will it just return the vector learned for entire word "ghrl" or all its constituent n-grams (<gh, ghr etc) will be added along with ghrl to get its embedding? For out-of-vocabulary words, I know that it will sum up embeddings of all the constituent n-grams of the word. But for in-vocabulary words, I wanted to know how it behaves?

kaushikacharya commented 4 years ago

For the second question, I understood that n-gram for entire word "ghrl" would be created, but my question is when I ask FastText model for embedding of "ghrl", will it just return the vector learned for entire word "ghrl" or all its constituent n-grams (<gh, ghr etc) will be added along with ghrl to get its embedding?

@adijad20 My understanding is that for every word(irrespective of whether its in vocabulary or not), FastText computes its vector embedding by adding all the constituent n-grams of the word.